PyPI - wavedl - Versions diffs - 1.4.0__tar.gz → 1.4.2__tar.gz - Mend

wavedl 1.4.0tar.gz → 1.4.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (42) hide show

{wavedl-1.4.0/src/wavedl.egg-info → wavedl-1.4.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: wavedl
-Version: 1.4.0
+Version: 1.4.2
 Summary: A Scalable Deep Learning Framework for Wave-Based Inverse Problems
 Author: Ductho Le
 License: MIT
@@ -502,12 +502,32 @@ WaveDL/
 | Argument | Default | Description |
 |----------|---------|-------------|
-| `--compile` | `False` | Enable `torch.compile` |
+| `--compile` | `False` | Enable `torch.compile` (recommended for long runs) |
 | `--precision` | `bf16` | Mixed precision mode (`bf16`, `fp16`, `no`) |
+| `--workers` | `-1` | DataLoader workers per GPU (-1=auto, up to 16) |
 | `--wandb` | `False` | Enable W&B logging |
+| `--wandb_watch` | `False` | Enable W&B gradient watching (adds overhead) |
 | `--project_name` | `DL-Training` | W&B project name |
 | `--run_name` | `None` | W&B run name (auto-generated if not set) |
+**Automatic GPU Optimizations:**
+WaveDL automatically enables performance optimizations for modern GPUs:
+| Optimization | Effect | GPU Support |
+|--------------|--------|-------------|
+| **TF32 precision** | ~2x speedup for float32 matmul | A100, H100 (Ampere+) |
+| **cuDNN benchmark** | Auto-tuned convolutions | All NVIDIA GPUs |
+| **Worker scaling** | Up to 16 workers per GPU | All systems |
+> [!NOTE]
+> These optimizations are **backward compatible** — they have no effect on older GPUs (V100, T4, GTX) or CPU-only systems. No configuration needed.
+**HPC Best Practices:**
+- Stage data to `$SLURM_TMPDIR` (local NVMe) for maximum I/O throughput
+- Use `--compile` for training runs > 50 epochs
+- Increase `--workers` manually if auto-detection is suboptimal
 </details>
 <details>
@@ -690,7 +710,7 @@ python -m wavedl.hpo --data_path train.npz --models cnn --n_trials 50
 python -m wavedl.hpo --data_path train.npz --models cnn resnet18 resnet50 vit_small densenet121 --n_trials 200
 ```
-**Step 3: Train with best parameters**
+**Train with best parameters**
 After HPO completes, it prints the optimal command:
 ```bash
@@ -837,7 +857,7 @@ import numpy as np
 X = np.random.randn(1000, 256, 256).astype(np.float32)
 y = np.random.randn(1000, 5).astype(np.float32)
-np.savez('test_data.npz', input_train=X, output_train=y)
+np.savez('test_data.npz', input_test=X, output_test=y)
 ```
 </details>
@@ -849,7 +869,7 @@ np.savez('test_data.npz', input_train=X, output_train=y)
 import numpy as np
 data = np.load('train_data.npz')
-assert data['input_train'].ndim == 3, "Input must be 3D: (N, H, W)"
+assert data['input_train'].ndim >= 2, "Input must be at least 2D: (N, ...) "
 assert data['output_train'].ndim == 2, "Output must be 2D: (N, T)"
 assert len(data['input_train']) == len(data['output_train']), "Sample mismatch"
@@ -1004,7 +1024,7 @@ Beyond the material characterization example above, the WaveDL pipeline can be a
 | Resource | Description |
 |----------|-------------|
 | Technical Paper | In-depth framework description *(coming soon)* |
-| [`_template.py`](models/_template.py) | Template for new architectures |
+| [`_template.py`](src/wavedl/models/_template.py) | Template for custom architectures |
 ---

{wavedl-1.4.0 → wavedl-1.4.2}/README.md RENAMED Viewed

@@ -457,12 +457,32 @@ WaveDL/
 | Argument | Default | Description |
 |----------|---------|-------------|
-| `--compile` | `False` | Enable `torch.compile` |
+| `--compile` | `False` | Enable `torch.compile` (recommended for long runs) |
 | `--precision` | `bf16` | Mixed precision mode (`bf16`, `fp16`, `no`) |
+| `--workers` | `-1` | DataLoader workers per GPU (-1=auto, up to 16) |
 | `--wandb` | `False` | Enable W&B logging |
+| `--wandb_watch` | `False` | Enable W&B gradient watching (adds overhead) |
 | `--project_name` | `DL-Training` | W&B project name |
 | `--run_name` | `None` | W&B run name (auto-generated if not set) |
+**Automatic GPU Optimizations:**
+WaveDL automatically enables performance optimizations for modern GPUs:
+| Optimization | Effect | GPU Support |
+|--------------|--------|-------------|
+| **TF32 precision** | ~2x speedup for float32 matmul | A100, H100 (Ampere+) |
+| **cuDNN benchmark** | Auto-tuned convolutions | All NVIDIA GPUs |
+| **Worker scaling** | Up to 16 workers per GPU | All systems |
+> [!NOTE]
+> These optimizations are **backward compatible** — they have no effect on older GPUs (V100, T4, GTX) or CPU-only systems. No configuration needed.
+**HPC Best Practices:**
+- Stage data to `$SLURM_TMPDIR` (local NVMe) for maximum I/O throughput
+- Use `--compile` for training runs > 50 epochs
+- Increase `--workers` manually if auto-detection is suboptimal
 </details>
 <details>
@@ -645,7 +665,7 @@ python -m wavedl.hpo --data_path train.npz --models cnn --n_trials 50
 python -m wavedl.hpo --data_path train.npz --models cnn resnet18 resnet50 vit_small densenet121 --n_trials 200
 ```
-**Step 3: Train with best parameters**
+**Train with best parameters**
 After HPO completes, it prints the optimal command:
 ```bash
@@ -792,7 +812,7 @@ import numpy as np
 X = np.random.randn(1000, 256, 256).astype(np.float32)
 y = np.random.randn(1000, 5).astype(np.float32)
-np.savez('test_data.npz', input_train=X, output_train=y)
+np.savez('test_data.npz', input_test=X, output_test=y)
 ```
 </details>
@@ -804,7 +824,7 @@ np.savez('test_data.npz', input_train=X, output_train=y)
 import numpy as np
 data = np.load('train_data.npz')
-assert data['input_train'].ndim == 3, "Input must be 3D: (N, H, W)"
+assert data['input_train'].ndim >= 2, "Input must be at least 2D: (N, ...) "
 assert data['output_train'].ndim == 2, "Output must be 2D: (N, T)"
 assert len(data['input_train']) == len(data['output_train']), "Sample mismatch"
@@ -959,7 +979,7 @@ Beyond the material characterization example above, the WaveDL pipeline can be a
 | Resource | Description |
 |----------|-------------|
 | Technical Paper | In-depth framework description *(coming soon)* |
-| [`_template.py`](models/_template.py) | Template for new architectures |
+| [`_template.py`](src/wavedl/models/_template.py) | Template for custom architectures |
 ---

{wavedl-1.4.0 → wavedl-1.4.2}/src/wavedl/__init__.py RENAMED Viewed

@@ -18,7 +18,7 @@ For inference:
     # or: python -m wavedl.test --checkpoint best_checkpoint --data_path test.npz
 """
-__version__ = "1.4.0"
+__version__ = "1.4.2"
 __author__ = "Ductho Le"
 __email__ = "ductho.le@outlook.com"

{wavedl-1.4.0 → wavedl-1.4.2}/src/wavedl/hpc.py RENAMED Viewed

@@ -130,6 +130,18 @@ Environment Variables:
         default=0,
         help="Rank of this machine in multi-node setup (default: 0)",
     )
+    parser.add_argument(
+        "--main_process_ip",
+        type=str,
+        default=None,
+        help="IP address of the main process for multi-node training",
+    )
+    parser.add_argument(
+        "--main_process_port",
+        type=int,
+        default=None,
+        help="Port for multi-node communication (default: accelerate auto-selects)",
+    )
     parser.add_argument(
         "--mixed_precision",
         type=str,
@@ -207,12 +219,18 @@ def main() -> int:
         "launch",
         f"--num_processes={num_gpus}",
         f"--num_machines={args.num_machines}",
-        "--machine_rank=0",
+        f"--machine_rank={args.machine_rank}",
         f"--mixed_precision={args.mixed_precision}",
         f"--dynamo_backend={args.dynamo_backend}",
-        "-m",
-        "wavedl.train",
-    ] + train_args
+    ]
+    # Add multi-node networking args if specified (required for some clusters)
+    if args.main_process_ip:
+        cmd.append(f"--main_process_ip={args.main_process_ip}")
+    if args.main_process_port:
+        cmd.append(f"--main_process_port={args.main_process_port}")
+    cmd += ["-m", "wavedl.train"] + train_args
     # Create output directory if specified
     for i, arg in enumerate(train_args):

{wavedl-1.4.0 → wavedl-1.4.2}/src/wavedl/hpo.py RENAMED Viewed

@@ -145,6 +145,7 @@ def create_objective(args):
         # Use temporary directory for trial output
         with tempfile.TemporaryDirectory() as tmpdir:
             cmd.extend(["--output_dir", tmpdir])
+            history_file = Path(tmpdir) / "training_history.csv"
             # Run training
             try:
@@ -156,29 +157,55 @@ def create_objective(args):
                     cwd=Path(__file__).parent,
                 )
-                # Parse validation loss from output
-                # Look for "Best val_loss: X.XXXX" in stdout
+                # Read best val_loss from training_history.csv (reliable machine-readable)
                 val_loss = None
-                for line in result.stdout.split("\n"):
-                    if "Best val_loss:" in line:
-                        try:
-                            val_loss = float(line.split(":")[-1].strip())
-                        except ValueError:
-                            pass
-                    # Also check for final validation loss
-                    if "val_loss=" in line.lower():
-                        try:
-                            # Extract number after val_loss=
-                            parts = line.lower().split("val_loss=")
-                            if len(parts) > 1:
-                                val_str = parts[1].split()[0].strip(",")
-                                val_loss = float(val_str)
-                        except (ValueError, IndexError):
-                            pass
+                if history_file.exists():
+                    try:
+                        import csv
+                        with open(history_file) as f:
+                            reader = csv.DictReader(f)
+                            val_losses = []
+                            for row in reader:
+                                if "val_loss" in row:
+                                    try:
+                                        val_losses.append(float(row["val_loss"]))
+                                    except (ValueError, TypeError):
+                                        pass
+                            if val_losses:
+                                val_loss = min(val_losses)  # Best (minimum) val_loss
+                    except Exception as e:
+                        print(f"Trial {trial.number}: Error reading history: {e}")
+                if val_loss is None:
+                    # Fallback: parse stdout for training log format
+                    # Pattern: "epoch | train_loss | val_loss | ..."
+                    # Use regex to avoid false positives from unrelated lines
+                    import re
+                    # Match lines like: "  42  | 0.0123   | 0.0156   | ..."
+                    log_pattern = re.compile(
+                        r"^\s*\d+\s*\|\s*[\d.]+\s*\|\s*([\d.]+)\s*\|"
+                    )
+                    val_losses_stdout = []
+                    for line in result.stdout.split("\n"):
+                        match = log_pattern.match(line)
+                        if match:
+                            try:
+                                val_losses_stdout.append(float(match.group(1)))
+                            except ValueError:
+                                continue
+                    if val_losses_stdout:
+                        val_loss = min(val_losses_stdout)
                 if val_loss is None:
                     # Training failed or no loss found
-                    print(f"Trial {trial.number}: Training failed")
+                    print(f"Trial {trial.number}: Training failed (no val_loss found)")
+                    if result.returncode != 0:
+                        # Show last few lines of stderr for debugging
+                        stderr_lines = result.stderr.strip().split("\n")[-3:]
+                        for line in stderr_lines:
+                            print(f"  stderr: {line}")
                     return float("inf")
                 print(f"Trial {trial.number}: val_loss={val_loss:.6f}")

{wavedl-1.4.0 → wavedl-1.4.2}/src/wavedl/models/_template.py RENAMED Viewed

@@ -1,23 +1,26 @@
 """
-Model Template for New Architectures
-=====================================
+Model Template for Custom Architectures
+========================================
-Copy this file and modify to add new model architectures to the framework.
+Copy this file and modify to add custom model architectures to WaveDL.
 The model will be automatically registered and available via --model flag.
-Steps to Add a New Model:
-    1. Copy this file to models/your_model.py
-    2. Rename the class and update @register_model("your_model")
-    3. Implement the __init__ and forward methods
-    4. Import your model in models/__init__.py:
-       from wavedl.models.your_model import YourModel
-    5. Run: accelerate launch -m wavedl.train --model your_model --wandb
+Quick Start:
+    1. Copy this file to your project: cp _template.py my_model.py
+    2. Rename the class and update @register_model("my_model")
+    3. Implement your architecture in __init__ and forward
+    4. Train: wavedl-train --import my_model --model my_model --data_path data.npz
+Requirements (your model MUST):
+    1. Inherit from BaseModel
+    2. Accept (in_shape, out_size, **kwargs) in __init__
+    3. Return tensor of shape (batch, out_size) from forward()
+See README.md "Adding Custom Models" section for more details.
 Author: Ductho Le (ductho.le@outlook.com)
 """
-from typing import Any
 import torch
 import torch.nn as nn
@@ -25,7 +28,7 @@ from wavedl.models.base import BaseModel
 # Uncomment the decorator to register this model
-# @register_model("template")
+# @register_model("my_model")
 class TemplateModel(BaseModel):
     """
     Template Model Architecture.
@@ -34,14 +37,16 @@ class TemplateModel(BaseModel):
     The first line will appear in --list_models output.
     Args:
-        in_shape: Input spatial dimensions (H, W)
-        out_size: Number of regression output targets
+        in_shape: Input spatial dimensions (auto-detected from data)
+                  - 1D: (L,) for signals
+                  - 2D: (H, W) for images
+                  - 3D: (D, H, W) for volumes
+        out_size: Number of regression targets (auto-detected from data)
         hidden_dim: Size of hidden layers (default: 256)
-        num_layers: Number of convolutional layers (default: 4)
         dropout: Dropout rate (default: 0.1)
     Input Shape:
-        (B, 1, H, W) - Single-channel images
+        (B, 1, *in_shape) - e.g., (B, 1, 64, 64) for 2D
     Output Shape:
         (B, out_size) - Regression predictions
@@ -49,10 +54,9 @@ class TemplateModel(BaseModel):
     def __init__(
         self,
-        in_shape: tuple[int, int],
+        in_shape: tuple,
         out_size: int,
         hidden_dim: int = 256,
-        num_layers: int = 4,
         dropout: float = 0.1,
         **kwargs,  # Accept extra kwargs for flexibility
     ):
@@ -61,14 +65,13 @@ class TemplateModel(BaseModel):
         # Store hyperparameters as attributes (optional but recommended)
         self.hidden_dim = hidden_dim
-        self.num_layers = num_layers
         self.dropout_rate = dropout
         # =================================================================
         # BUILD YOUR ARCHITECTURE HERE
         # =================================================================
-        # Example: Simple CNN encoder
+        # Example: Simple CNN encoder (assumes 2D input with 1 channel)
         self.encoder = nn.Sequential(
             # Layer 1
             nn.Conv2d(1, 32, kernel_size=3, padding=1),
@@ -106,10 +109,10 @@ class TemplateModel(BaseModel):
         """
         Forward pass of the model.
-        REQUIRED: Must accept (B, C, H, W) and return (B, out_size)
+        REQUIRED: Must accept (B, C, *spatial) and return (B, out_size)
         Args:
-            x: Input tensor of shape (B, 1, H, W)
+            x: Input tensor of shape (B, 1, *in_shape)
         Returns:
             Output tensor of shape (B, out_size)
@@ -122,35 +125,20 @@ class TemplateModel(BaseModel):
         return output
-    @classmethod
-    def get_default_config(cls) -> dict[str, Any]:
-        """
-        Return default hyperparameters for this model.
-        OPTIONAL: Override to provide model-specific defaults.
-        These can be used for documentation or config files.
-        """
-        return {
-            "hidden_dim": 256,
-            "num_layers": 4,
-            "dropout": 0.1,
-        }
 # =============================================================================
 # USAGE EXAMPLE
 # =============================================================================
 if __name__ == "__main__":
     # Quick test of the model
-    model = TemplateModel(in_shape=(500, 500), out_size=5)
+    model = TemplateModel(in_shape=(64, 64), out_size=5)
     # Print model summary
     print(f"Model: {model.__class__.__name__}")
     print(f"Parameters: {model.count_parameters():,}")
-    print(f"Default config: {model.get_default_config()}")
     # Test forward pass
-    dummy_input = torch.randn(2, 1, 500, 500)
+    dummy_input = torch.randn(2, 1, 64, 64)
     output = model(dummy_input)
     print(f"Input shape: {dummy_input.shape}")
     print(f"Output shape: {output.shape}")

{wavedl-1.4.0 → wavedl-1.4.2}/src/wavedl/models/base.py RENAMED Viewed

@@ -75,13 +75,61 @@ class BaseModel(nn.Module, ABC):
         Forward pass of the model.
         Args:
-            x: Input tensor of shape (B, C, H, W)
+            x: Input tensor of shape (B, C, *spatial_dims)
+               - 1D: (B, C, L)
+               - 2D: (B, C, H, W)
+               - 3D: (B, C, D, H, W)
         Returns:
             Output tensor of shape (B, out_size)
         """
         pass
+    def validate_input_shape(self, x: torch.Tensor) -> None:
+        """
+        Validate input tensor shape against model's expected shape.
+        Call this at the start of forward() for explicit shape contract enforcement.
+        Provides clear, actionable error messages instead of cryptic Conv layer errors.
+        Args:
+            x: Input tensor to validate
+        Raises:
+            ValueError: If shape doesn't match expected dimensions
+        Example:
+            def forward(self, x):
+                self.validate_input_shape(x)  # Optional but recommended
+                return self.model(x)
+        """
+        expected_ndim = len(self.in_shape) + 2  # +2 for (batch, channel)
+        if x.ndim != expected_ndim:
+            dim_names = {
+                3: "1D (B, C, L)",
+                4: "2D (B, C, H, W)",
+                5: "3D (B, C, D, H, W)",
+            }
+            expected_name = dim_names.get(expected_ndim, f"{expected_ndim}D")
+            actual_name = dim_names.get(x.ndim, f"{x.ndim}D")
+            raise ValueError(
+                f"Input shape mismatch: model expects {expected_name} input, "
+                f"got {actual_name} with shape {tuple(x.shape)}.\n"
+                f"Expected in_shape: {self.in_shape} -> input should be (B, C, {', '.join(map(str, self.in_shape))})\n"
+                f"Hint: Check your data preprocessing - you may need to add/remove dimensions."
+            )
+        # Validate spatial dimensions match
+        spatial_dims = tuple(x.shape[2:])  # Skip batch and channel
+        if spatial_dims != tuple(self.in_shape):
+            raise ValueError(
+                f"Spatial dimension mismatch: model expects {self.in_shape}, "
+                f"got {spatial_dims}.\n"
+                f"Full input shape: {tuple(x.shape)} (B={x.shape[0]}, C={x.shape[1]})\n"
+                f"Hint: Ensure your data dimensions match the model's in_shape."
+            )
     def count_parameters(self, trainable_only: bool = True) -> int:
         """
         Count the number of parameters in the model.

{wavedl-1.4.0 → wavedl-1.4.2}/src/wavedl/train.py RENAMED Viewed

@@ -12,11 +12,11 @@ A modular training framework for wave-based inverse problems and regression:
   6. Deep Observability: WandB integration with scatter analysis
 Usage:
-    # Recommended: Using the HPC launcher
-    wavedl-hpc --model cnn --batch_size 128 --wandb
+    # Recommended: Using the HPC launcher (handles accelerate configuration)
+    wavedl-hpc --model cnn --batch_size 128 --mixed_precision bf16 --wandb
-    # Or with direct accelerate launch
-    accelerate launch -m wavedl.train --model cnn --batch_size 128 --wandb
+    # Or direct training module (use --precision, not --mixed_precision)
+    accelerate launch -m wavedl.train --model cnn --batch_size 128 --precision bf16
     # Multi-GPU with explicit config
     wavedl-hpc --num_gpus 4 --mixed_precision bf16 --model cnn --wandb
@@ -28,9 +28,9 @@ Usage:
     wavedl-train --list_models
 Note:
-    For HPC clusters (Compute Canada, etc.), use wavedl-hpc which handles
-    environment configuration automatically. Mixed precision is controlled via
-    --mixed_precision flag (default: bf16).
+    - wavedl-hpc: Uses --mixed_precision (passed to accelerate launch)
+    - wavedl.train: Uses --precision (internal module flag)
+    Both control the same behavior; use the appropriate flag for your entry point.
 Author: Ductho Le (ductho.le@outlook.com)
 """
@@ -97,6 +97,18 @@ warnings.filterwarnings("ignore", category=DeprecationWarning)
 warnings.filterwarnings("ignore", module="pydantic")
 warnings.filterwarnings("ignore", message=".*UnsupportedFieldAttributeWarning.*")
+# ==============================================================================
+# GPU PERFORMANCE OPTIMIZATIONS (Ampere/Hopper: A100, H100)
+# ==============================================================================
+# Enable TF32 for faster matmul (safe precision for training, ~2x speedup)
+torch.backends.cuda.matmul.allow_tf32 = True
+torch.backends.cudnn.allow_tf32 = True
+torch.set_float32_matmul_precision("high")  # Use TF32 for float32 ops
+# Enable cuDNN autotuning for fixed-size inputs (CNN-like models benefit most)
+# Note: First few batches may be slower due to benchmarking
+torch.backends.cudnn.benchmark = True
 # ==============================================================================
 # ARGUMENT PARSING
@@ -298,11 +310,24 @@ def parse_args() -> argparse.Namespace:
         choices=["bf16", "fp16", "no"],
         help="Mixed precision mode",
     )
+    # Alias for consistency with wavedl-hpc (--mixed_precision)
+    parser.add_argument(
+        "--mixed_precision",
+        dest="precision",
+        type=str,
+        choices=["bf16", "fp16", "no"],
+        help=argparse.SUPPRESS,  # Hidden: use --precision instead
+    )
     # Logging
     parser.add_argument(
         "--wandb", action="store_true", help="Enable Weights & Biases logging"
     )
+    parser.add_argument(
+        "--wandb_watch",
+        action="store_true",
+        help="Enable WandB gradient watching (adds overhead, useful for debugging)",
+    )
     parser.add_argument(
         "--project_name", type=str, default="DL-Training", help="WandB project name"
     )
@@ -467,8 +492,8 @@ def main():
             if _cv_handle is not None and hasattr(_cv_handle, "close"):
                 try:
                     _cv_handle.close()
-                except Exception:
-                    pass
+                except Exception as e:
+                    logging.debug(f"Failed to close CV data handle: {e}")
         return
     # ==========================================================================
@@ -496,9 +521,9 @@ def main():
     if args.workers < 0:
         cpu_count = os.cpu_count() or 4
         num_gpus = accelerator.num_processes
-        # Heuristic: 4-8 workers per GPU, bounded by available CPU cores
-        # Leave some cores for main process and system overhead
-        args.workers = min(8, max(2, (cpu_count - 2) // num_gpus))
+        # Heuristic: 4-16 workers per GPU, bounded by available CPU cores
+        # Increased cap from 8 to 16 for high-throughput GPUs (H100, A100)
+        args.workers = min(16, max(2, (cpu_count - 2) // num_gpus))
         if accelerator.is_main_process:
             logger.info(
                 f"⚙️  Auto-detected workers: {args.workers} per GPU "
@@ -544,9 +569,15 @@ def main():
         )
         logger.info(f"   Model Size: {param_info['total_mb']:.2f} MB")
-    # Optional WandB model watching
-    if args.wandb and WANDB_AVAILABLE and accelerator.is_main_process:
+    # Optional WandB model watching (opt-in due to overhead on large models)
+    if (
+        args.wandb
+        and args.wandb_watch
+        and WANDB_AVAILABLE
+        and accelerator.is_main_process
+    ):
         wandb.watch(model, log="gradients", log_freq=100)
+        logger.info("   📊 WandB gradient watching enabled")
     # Torch 2.0 compilation (requires compatible Triton on GPU)
     if args.compile:
@@ -820,7 +851,7 @@ def main():
             val_mae_sum = torch.zeros(out_dim, device=accelerator.device)
             val_samples = 0
-            # Accumulate predictions locally, gather ONCE at end (reduces sync overhead)
+            # Accumulate predictions locally ON CPU to prevent GPU OOM
             local_preds = []
             local_targets = []
@@ -836,17 +867,19 @@ def main():
                     mae_batch = torch.abs((pred - y) * phys_scale).sum(dim=0)
                     val_mae_sum += mae_batch
-                    # Store locally (no GPU sync per batch)
-                    local_preds.append(pred)
-                    local_targets.append(y)
+                    # Store on CPU (critical for large val sets)
+                    local_preds.append(pred.detach().cpu())
+                    local_targets.append(y.detach().cpu())
+            # Concatenate locally on CPU (no GPU memory spike)
+            cpu_preds = torch.cat(local_preds)
+            cpu_targets = torch.cat(local_targets)
-            # Single gather at end of validation (2 syncs instead of 2×num_batches)
-            all_local_preds = torch.cat(local_preds)
-            all_local_targets = torch.cat(local_targets)
-            all_preds = accelerator.gather_for_metrics(all_local_preds)
-            all_targets = accelerator.gather_for_metrics(all_local_targets)
+            # Gather to rank 0 only via gather_object (avoids all-gather to every rank)
+            # gather_object returns list of objects from each rank: [(preds0, targs0), (preds1, targs1), ...]
+            gathered = accelerator.gather_object((cpu_preds, cpu_targets))
-            # Synchronize validation metrics
+            # Synchronize validation metrics (scalars only - efficient)
             val_loss_scalar = val_loss_sum.item()
             val_metrics = torch.cat(
                 [
@@ -869,9 +902,14 @@ def main():
             # ==================== LOGGING & CHECKPOINTING ====================
             if accelerator.is_main_process:
-                # Scientific metrics - cast to float32 before numpy (bf16 can't convert)
-                y_pred = all_preds.float().cpu().numpy()
-                y_true = all_targets.float().cpu().numpy()
+                # Concatenate gathered tensors from all ranks (only on rank 0)
+                # gathered is list of tuples: [(preds_rank0, targs_rank0), (preds_rank1, targs_rank1), ...]
+                all_preds = torch.cat([item[0] for item in gathered])
+                all_targets = torch.cat([item[1] for item in gathered])
+                # Scientific metrics - cast to float32 before numpy
+                y_pred = all_preds.float().numpy()
+                y_true = all_targets.float().numpy()
                 # Trim DDP padding
                 real_len = len(val_dl.dataset)

{wavedl-1.4.0 → wavedl-1.4.2}/src/wavedl/utils/config.py RENAMED Viewed

@@ -116,7 +116,13 @@ def merge_config_with_args(
     """
     # Get parser defaults to detect which args were explicitly set by user
     if parser is not None:
-        defaults = vars(parser.parse_args([]))
+        # Safe extraction: iterate actions instead of parse_args([])
+        # This avoids failures if required arguments are added later
+        defaults = {
+            action.dest: action.default
+            for action in parser._actions
+            if action.dest != "help"
+        }
     else:
         # Fallback: reconstruct defaults from known patterns
         # This works because argparse stores actual values, and we compare
@@ -141,6 +147,9 @@ def merge_config_with_args(
             setattr(args, key, value)
         elif not ignore_unknown:
             logging.warning(f"Unknown config key: {key}")
+        else:
+            # Even in ignore_unknown mode, log for discoverability
+            logging.debug(f"Config key '{key}' ignored: not a valid argument")
     return args
@@ -188,12 +197,15 @@ def save_config(
     return str(output_path)
-def validate_config(config: dict[str, Any]) -> list[str]:
+def validate_config(
+    config: dict[str, Any], known_keys: list[str] | None = None
+) -> list[str]:
     """
     Validate configuration values against known options.
     Args:
         config: Configuration dictionary
+        known_keys: Optional list of valid keys (if None, uses defaults from parser args)
     Returns:
         List of warning messages (empty if valid)
@@ -229,9 +241,83 @@ def validate_config(config: dict[str, Any]) -> list[str]:
     for key, (min_val, max_val, msg) in numeric_checks.items():
         if key in config:
             val = config[key]
+            # Type check: ensure value is numeric before comparison
+            if not isinstance(val, (int, float)):
+                warnings.append(
+                    f"Invalid type for '{key}': expected number, got {type(val).__name__} ({val!r})"
+                )
+                continue
             if not (min_val <= val <= max_val):
                 warnings.append(f"{msg}: got {val}")
+    # Check for unknown/unrecognized keys (helps catch typos)
+    # Default known keys based on common training arguments
+    default_known_keys = {
+        # Model
+        "model",
+        "import_modules",
+        # Hyperparameters
+        "batch_size",
+        "lr",
+        "epochs",
+        "patience",
+        "weight_decay",
+        "grad_clip",
+        # Loss
+        "loss",
+        "huber_delta",
+        "loss_weights",
+        # Optimizer
+        "optimizer",
+        "momentum",
+        "nesterov",
+        "betas",
+        # Scheduler
+        "scheduler",
+        "scheduler_patience",
+        "min_lr",
+        "scheduler_factor",
+        "warmup_epochs",
+        "step_size",
+        "milestones",
+        # Data
+        "data_path",
+        "workers",
+        "seed",
+        "single_channel",
+        # Cross-validation
+        "cv",
+        "cv_stratify",
+        "cv_bins",
+        # Checkpointing
+        "resume",
+        "save_every",
+        "output_dir",
+        "fresh",
+        # Performance
+        "compile",
+        "precision",
+        "mixed_precision",
+        # Logging
+        "wandb",
+        "wandb_watch",
+        "project_name",
+        "run_name",
+        # Config
+        "config",
+        "list_models",
+        # Metadata (internal)
+        "_metadata",
+    }
+    check_keys = set(known_keys) if known_keys else default_known_keys
+    for key in config:
+        if key not in check_keys:
+            warnings.append(
+                f"Unknown config key: '{key}' - check for typos or see wavedl-train --help"
+            )
     return warnings

{wavedl-1.4.0 → wavedl-1.4.2}/src/wavedl/utils/data.py RENAMED Viewed

@@ -735,7 +735,7 @@ def load_test_data(
     try:
         inp, outp = source.load(path)
     except KeyError:
-        # Try with just inputs if outputs not found
+        # Try with just inputs if outputs not found (inference-only mode)
         if format == "npz":
             data = np.load(path, allow_pickle=True)
             keys = list(data.keys())
@@ -751,6 +751,54 @@ def load_test_data(
                 )
             out_key = DataSource._find_key(keys, custom_output_keys)
             outp = data[out_key] if out_key else None
+        elif format == "hdf5":
+            # HDF5: input-only loading for inference
+            with h5py.File(path, "r") as f:
+                keys = list(f.keys())
+                inp_key = DataSource._find_key(keys, custom_input_keys)
+                if inp_key is None:
+                    raise KeyError(
+                        f"Input key not found. Tried: {custom_input_keys}. Found: {keys}"
+                    )
+                # Check size - load_test_data is eager, large files should use DataLoader
+                n_samples = f[inp_key].shape[0]
+                if n_samples > 100000:
+                    raise ValueError(
+                        f"Dataset has {n_samples:,} samples. load_test_data() loads "
+                        f"everything into RAM which may cause OOM. For large inference "
+                        f"sets, use a DataLoader with HDF5Source.load_mmap() instead."
+                    )
+                inp = f[inp_key][:]
+                out_key = DataSource._find_key(keys, custom_output_keys)
+                outp = f[out_key][:] if out_key else None
+        elif format == "mat":
+            # MAT v7.3: input-only loading with proper sparse handling
+            mat_source = MATSource()
+            with h5py.File(path, "r") as f:
+                keys = list(f.keys())
+                inp_key = DataSource._find_key(keys, custom_input_keys)
+                if inp_key is None:
+                    raise KeyError(
+                        f"Input key not found. Tried: {custom_input_keys}. Found: {keys}"
+                    )
+                # Check size - load_test_data is eager, large files should use DataLoader
+                n_samples = f[inp_key].shape[-1]  # MAT is transposed
+                if n_samples > 100000:
+                    raise ValueError(
+                        f"Dataset has {n_samples:,} samples. load_test_data() loads "
+                        f"everything into RAM which may cause OOM. For large inference "
+                        f"sets, use a DataLoader with MATSource.load_mmap() instead."
+                    )
+                # Use _load_dataset for sparse support and proper transpose
+                inp = mat_source._load_dataset(f, inp_key)
+                out_key = DataSource._find_key(keys, custom_output_keys)
+                if out_key:
+                    outp = mat_source._load_dataset(f, out_key)
+                    # Handle 1D outputs that become (1, N) after transpose
+                    if outp.ndim == 2 and outp.shape[0] == 1:
+                        outp = outp.T
+                else:
+                    outp = None
         else:
             raise
@@ -949,6 +997,15 @@ def prepare_data(
             with open(META_FILE, "rb") as f:
                 meta = pickle.load(f)
             cached_data_path = meta.get("data_path", None)
+            cached_file_size = meta.get("file_size", None)
+            cached_file_mtime = meta.get("file_mtime", None)
+            # Get current file stats
+            current_stats = os.stat(args.data_path)
+            current_size = current_stats.st_size
+            current_mtime = current_stats.st_mtime
+            # Check if data path changed
             if cached_data_path != os.path.abspath(args.data_path):
                 if accelerator.is_main_process:
                     logger.warning(
@@ -958,6 +1015,23 @@ def prepare_data(
                         f"   Invalidating cache and regenerating..."
                     )
                 cache_exists = False
+            # Check if file was modified (size or mtime changed)
+            elif cached_file_size is not None and cached_file_size != current_size:
+                if accelerator.is_main_process:
+                    logger.warning(
+                        f"⚠️  Data file size changed!\n"
+                        f"   Cached size: {cached_file_size:,} bytes\n"
+                        f"   Current size: {current_size:,} bytes\n"
+                        f"   Invalidating cache and regenerating..."
+                    )
+                cache_exists = False
+            elif cached_file_mtime is not None and cached_file_mtime != current_mtime:
+                if accelerator.is_main_process:
+                    logger.warning(
+                        "⚠️  Data file was modified!\n"
+                        "   Cache may be stale, regenerating..."
+                    )
+                cache_exists = False
         except Exception:
             cache_exists = False
@@ -1053,13 +1127,16 @@ def prepare_data(
                 f"   Shape Detected: {full_shape} [{dim_type}] | Output Dim: {out_dim}"
             )
-            # Save metadata (including data path for cache validation)
+            # Save metadata (including data path, size, mtime for cache validation)
+            file_stats = os.stat(args.data_path)
             with open(META_FILE, "wb") as f:
                 pickle.dump(
                     {
                         "shape": full_shape,
                         "out_dim": out_dim,
                         "data_path": os.path.abspath(args.data_path),
+                        "file_size": file_stats.st_size,
+                        "file_mtime": file_stats.st_mtime,
                     },
                     f,
                 )

{wavedl-1.4.0 → wavedl-1.4.2/src/wavedl.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: wavedl
-Version: 1.4.0
+Version: 1.4.2
 Summary: A Scalable Deep Learning Framework for Wave-Based Inverse Problems
 Author: Ductho Le
 License: MIT
@@ -502,12 +502,32 @@ WaveDL/
 | Argument | Default | Description |
 |----------|---------|-------------|
-| `--compile` | `False` | Enable `torch.compile` |
+| `--compile` | `False` | Enable `torch.compile` (recommended for long runs) |
 | `--precision` | `bf16` | Mixed precision mode (`bf16`, `fp16`, `no`) |
+| `--workers` | `-1` | DataLoader workers per GPU (-1=auto, up to 16) |
 | `--wandb` | `False` | Enable W&B logging |
+| `--wandb_watch` | `False` | Enable W&B gradient watching (adds overhead) |
 | `--project_name` | `DL-Training` | W&B project name |
 | `--run_name` | `None` | W&B run name (auto-generated if not set) |
+**Automatic GPU Optimizations:**
+WaveDL automatically enables performance optimizations for modern GPUs:
+| Optimization | Effect | GPU Support |
+|--------------|--------|-------------|
+| **TF32 precision** | ~2x speedup for float32 matmul | A100, H100 (Ampere+) |
+| **cuDNN benchmark** | Auto-tuned convolutions | All NVIDIA GPUs |
+| **Worker scaling** | Up to 16 workers per GPU | All systems |
+> [!NOTE]
+> These optimizations are **backward compatible** — they have no effect on older GPUs (V100, T4, GTX) or CPU-only systems. No configuration needed.
+**HPC Best Practices:**
+- Stage data to `$SLURM_TMPDIR` (local NVMe) for maximum I/O throughput
+- Use `--compile` for training runs > 50 epochs
+- Increase `--workers` manually if auto-detection is suboptimal
 </details>
 <details>
@@ -690,7 +710,7 @@ python -m wavedl.hpo --data_path train.npz --models cnn --n_trials 50
 python -m wavedl.hpo --data_path train.npz --models cnn resnet18 resnet50 vit_small densenet121 --n_trials 200
 ```
-**Step 3: Train with best parameters**
+**Train with best parameters**
 After HPO completes, it prints the optimal command:
 ```bash
@@ -837,7 +857,7 @@ import numpy as np
 X = np.random.randn(1000, 256, 256).astype(np.float32)
 y = np.random.randn(1000, 5).astype(np.float32)
-np.savez('test_data.npz', input_train=X, output_train=y)
+np.savez('test_data.npz', input_test=X, output_test=y)
 ```
 </details>
@@ -849,7 +869,7 @@ np.savez('test_data.npz', input_train=X, output_train=y)
 import numpy as np
 data = np.load('train_data.npz')
-assert data['input_train'].ndim == 3, "Input must be 3D: (N, H, W)"
+assert data['input_train'].ndim >= 2, "Input must be at least 2D: (N, ...) "
 assert data['output_train'].ndim == 2, "Output must be 2D: (N, T)"
 assert len(data['input_train']) == len(data['output_train']), "Sample mismatch"
@@ -1004,7 +1024,7 @@ Beyond the material characterization example above, the WaveDL pipeline can be a
 | Resource | Description |
 |----------|-------------|
 | Technical Paper | In-depth framework description *(coming soon)* |
-| [`_template.py`](models/_template.py) | Template for new architectures |
+| [`_template.py`](src/wavedl/models/_template.py) | Template for custom architectures |
 ---