npm - @zigrivers/scaffold - Versions diffs - 3.13.0 → 3.15.0 - Mend

@zigrivers/scaffold 3.13.0 → 3.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (180) hide show

package/README.md +32 -10
package/content/knowledge/research/research-architecture.md +385 -0
package/content/knowledge/research/research-conventions.md +248 -0
package/content/knowledge/research/research-dev-environment.md +303 -0
package/content/knowledge/research/research-experiment-loop.md +429 -0
package/content/knowledge/research/research-experiment-tracking.md +336 -0
package/content/knowledge/research/research-ml-architecture-search.md +383 -0
package/content/knowledge/research/research-ml-evaluation.md +407 -0
package/content/knowledge/research/research-ml-experiment-tracking.md +466 -0
package/content/knowledge/research/research-ml-training-patterns.md +413 -0
package/content/knowledge/research/research-observability.md +395 -0
package/content/knowledge/research/research-overfitting-prevention.md +306 -0
package/content/knowledge/research/research-project-structure.md +264 -0
package/content/knowledge/research/research-quant-backtesting.md +326 -0
package/content/knowledge/research/research-quant-market-data.md +366 -0
package/content/knowledge/research/research-quant-metrics.md +335 -0
package/content/knowledge/research/research-quant-requirements.md +223 -0
package/content/knowledge/research/research-quant-risk.md +469 -0
package/content/knowledge/research/research-quant-strategy-patterns.md +412 -0
package/content/knowledge/research/research-requirements.md +201 -0
package/content/knowledge/research/research-security.md +374 -0
package/content/knowledge/research/research-sim-compute-management.md +538 -0
package/content/knowledge/research/research-sim-engine-patterns.md +448 -0
package/content/knowledge/research/research-sim-parameter-spaces.md +425 -0
package/content/knowledge/research/research-sim-validation.md +456 -0
package/content/knowledge/research/research-testing.md +334 -0
package/content/methodology/research-ml-research.yml +23 -0
package/content/methodology/research-overlay.yml +65 -0
package/content/methodology/research-quant-finance.yml +29 -0
package/content/methodology/research-simulation.yml +23 -0
package/dist/cli/commands/adopt.d.ts.map +1 -1
package/dist/cli/commands/adopt.js +30 -8
package/dist/cli/commands/adopt.js.map +1 -1
package/dist/cli/commands/adopt.serialization.test.js +49 -0
package/dist/cli/commands/adopt.serialization.test.js.map +1 -1
package/dist/cli/commands/adopt.test.js +8 -0
package/dist/cli/commands/adopt.test.js.map +1 -1
package/dist/cli/commands/build.d.ts.map +1 -1
package/dist/cli/commands/build.js +191 -180
package/dist/cli/commands/build.js.map +1 -1
package/dist/cli/commands/complete.d.ts.map +1 -1
package/dist/cli/commands/complete.js +16 -12
package/dist/cli/commands/complete.js.map +1 -1
package/dist/cli/commands/complete.test.js +14 -5
package/dist/cli/commands/complete.test.js.map +1 -1
package/dist/cli/commands/init.d.ts +4 -0
package/dist/cli/commands/init.d.ts.map +1 -1
package/dist/cli/commands/init.js +75 -51
package/dist/cli/commands/init.js.map +1 -1
package/dist/cli/commands/init.test.js +33 -27
package/dist/cli/commands/init.test.js.map +1 -1
package/dist/cli/commands/reset.d.ts.map +1 -1
package/dist/cli/commands/reset.js +44 -40
package/dist/cli/commands/reset.js.map +1 -1
package/dist/cli/commands/reset.test.js +42 -20
package/dist/cli/commands/reset.test.js.map +1 -1
package/dist/cli/commands/rework.d.ts.map +1 -1
package/dist/cli/commands/rework.js +16 -12
package/dist/cli/commands/rework.js.map +1 -1
package/dist/cli/commands/rework.test.js +12 -3
package/dist/cli/commands/rework.test.js.map +1 -1
package/dist/cli/commands/run.d.ts.map +1 -1
package/dist/cli/commands/run.js +318 -298
package/dist/cli/commands/run.js.map +1 -1
package/dist/cli/commands/run.test.js +92 -120
package/dist/cli/commands/run.test.js.map +1 -1
package/dist/cli/commands/skip.d.ts.map +1 -1
package/dist/cli/commands/skip.js +19 -15
package/dist/cli/commands/skip.js.map +1 -1
package/dist/cli/commands/skip.test.js +22 -11
package/dist/cli/commands/skip.test.js.map +1 -1
package/dist/cli/commands/update.d.ts.map +1 -1
package/dist/cli/commands/update.js +3 -1
package/dist/cli/commands/update.js.map +1 -1
package/dist/cli/commands/update.test.js +8 -4
package/dist/cli/commands/update.test.js.map +1 -1
package/dist/cli/commands/version.d.ts.map +1 -1
package/dist/cli/commands/version.js +3 -1
package/dist/cli/commands/version.js.map +1 -1
package/dist/cli/commands/version.test.js +9 -5
package/dist/cli/commands/version.test.js.map +1 -1
package/dist/cli/index.d.ts.map +1 -1
package/dist/cli/index.js +2 -0
package/dist/cli/index.js.map +1 -1
package/dist/cli/init-flag-families.d.ts +6 -1
package/dist/cli/init-flag-families.d.ts.map +1 -1
package/dist/cli/init-flag-families.js +32 -1
package/dist/cli/init-flag-families.js.map +1 -1
package/dist/cli/init-flag-families.test.js +47 -0
package/dist/cli/init-flag-families.test.js.map +1 -1
package/dist/cli/output/interactive.d.ts +1 -0
package/dist/cli/output/interactive.d.ts.map +1 -1
package/dist/cli/output/interactive.js +5 -0
package/dist/cli/output/interactive.js.map +1 -1
package/dist/cli/shutdown.d.ts +51 -0
package/dist/cli/shutdown.d.ts.map +1 -0
package/dist/cli/shutdown.js +199 -0
package/dist/cli/shutdown.js.map +1 -0
package/dist/cli/shutdown.test.d.ts +2 -0
package/dist/cli/shutdown.test.d.ts.map +1 -0
package/dist/cli/shutdown.test.js +316 -0
package/dist/cli/shutdown.test.js.map +1 -0
package/dist/config/schema.d.ts +272 -16
package/dist/config/schema.d.ts.map +1 -1
package/dist/config/schema.js +25 -1
package/dist/config/schema.js.map +1 -1
package/dist/config/schema.test.js +103 -3
package/dist/config/schema.test.js.map +1 -1
package/dist/core/assembly/overlay-loader.d.ts +12 -0
package/dist/core/assembly/overlay-loader.d.ts.map +1 -1
package/dist/core/assembly/overlay-loader.js +30 -0
package/dist/core/assembly/overlay-loader.js.map +1 -1
package/dist/core/assembly/overlay-loader.test.js +66 -1
package/dist/core/assembly/overlay-loader.test.js.map +1 -1
package/dist/core/assembly/overlay-state-resolver.d.ts.map +1 -1
package/dist/core/assembly/overlay-state-resolver.js +48 -19
package/dist/core/assembly/overlay-state-resolver.js.map +1 -1
package/dist/core/assembly/overlay-state-resolver.test.js +80 -0
package/dist/core/assembly/overlay-state-resolver.test.js.map +1 -1
package/dist/e2e/init.test.js +5 -4
package/dist/e2e/init.test.js.map +1 -1
package/dist/e2e/project-type-overlays.test.js +119 -0
package/dist/e2e/project-type-overlays.test.js.map +1 -1
package/dist/project/adopt.d.ts.map +1 -1
package/dist/project/adopt.js +3 -1
package/dist/project/adopt.js.map +1 -1
package/dist/project/detectors/disambiguate.js +1 -1
package/dist/project/detectors/disambiguate.js.map +1 -1
package/dist/project/detectors/index.d.ts.map +1 -1
package/dist/project/detectors/index.js +2 -1
package/dist/project/detectors/index.js.map +1 -1
package/dist/project/detectors/ml.d.ts.map +1 -1
package/dist/project/detectors/ml.js +2 -6
package/dist/project/detectors/ml.js.map +1 -1
package/dist/project/detectors/research.d.ts +4 -0
package/dist/project/detectors/research.d.ts.map +1 -0
package/dist/project/detectors/research.js +141 -0
package/dist/project/detectors/research.js.map +1 -0
package/dist/project/detectors/research.test.d.ts +2 -0
package/dist/project/detectors/research.test.d.ts.map +1 -0
package/dist/project/detectors/research.test.js +235 -0
package/dist/project/detectors/research.test.js.map +1 -0
package/dist/project/detectors/shared-signals.d.ts +3 -0
package/dist/project/detectors/shared-signals.d.ts.map +1 -0
package/dist/project/detectors/shared-signals.js +9 -0
package/dist/project/detectors/shared-signals.js.map +1 -0
package/dist/project/detectors/types.d.ts +6 -2
package/dist/project/detectors/types.d.ts.map +1 -1
package/dist/project/detectors/types.js.map +1 -1
package/dist/state/lock-manager.d.ts +1 -0
package/dist/state/lock-manager.d.ts.map +1 -1
package/dist/state/lock-manager.js +1 -1
package/dist/state/lock-manager.js.map +1 -1
package/dist/types/config.d.ts +7 -1
package/dist/types/config.d.ts.map +1 -1
package/dist/wizard/copy/core.d.ts.map +1 -1
package/dist/wizard/copy/core.js +4 -0
package/dist/wizard/copy/core.js.map +1 -1
package/dist/wizard/copy/index.d.ts.map +1 -1
package/dist/wizard/copy/index.js +2 -0
package/dist/wizard/copy/index.js.map +1 -1
package/dist/wizard/copy/research.d.ts +3 -0
package/dist/wizard/copy/research.d.ts.map +1 -0
package/dist/wizard/copy/research.js +27 -0
package/dist/wizard/copy/research.js.map +1 -0
package/dist/wizard/copy/types.d.ts +5 -1
package/dist/wizard/copy/types.d.ts.map +1 -1
package/dist/wizard/flags.d.ts +7 -1
package/dist/wizard/flags.d.ts.map +1 -1
package/dist/wizard/questions.d.ts +4 -2
package/dist/wizard/questions.d.ts.map +1 -1
package/dist/wizard/questions.js +27 -1
package/dist/wizard/questions.js.map +1 -1
package/dist/wizard/questions.test.js +51 -0
package/dist/wizard/questions.test.js.map +1 -1
package/dist/wizard/wizard.d.ts +3 -2
package/dist/wizard/wizard.d.ts.map +1 -1
package/dist/wizard/wizard.js +3 -1
package/dist/wizard/wizard.js.map +1 -1
package/package.json +1 -1

package/content/knowledge/research/research-observability.md ADDED Viewed

@@ -0,0 +1,395 @@
+---
+name: research-observability
+description: Monitoring experiment loops including anomaly detection, resource tracking, progress dashboards, and alert thresholds for research projects
+topics: [research, observability, monitoring, anomaly-detection, resource-tracking, dashboards, alerts]
+---
+Autonomous experiment loops can run for hours or days without human attention. Without observability, a loop can waste compute on a converged metric, silently produce garbage after a data pipeline failure, or exhaust disk space without anyone noticing. Observability for research is not about uptime SLAs -- it is about experiment health: is the loop making progress, are the results valid, and are resources being consumed at a reasonable rate.
+## Summary
+Monitor three dimensions of experiment loop health: progress (is the primary metric improving, how much budget remains), validity (are results within expected ranges, are there anomalies in metric distributions), and resources (CPU, memory, disk, GPU utilization, and cost). Implement structured logging with metric history, anomaly detection on metric time series, and alerting for budget thresholds and stalled progress. Provide both real-time terminal output and persistent dashboards for async review.
+## Deep Guidance
+### Structured Logging
+Use structured logging (JSON lines) for all experiment output so it can be parsed programmatically:
+```python
+# src/observability/structured_log.py
+import structlog
+import sys
+def configure_logging(log_path: str | None = None, level: str = "INFO"):
+    """Configure structured logging for the experiment loop."""
+    processors = [
+        structlog.stdlib.add_log_level,
+        structlog.stdlib.add_logger_name,
+        structlog.processors.TimeStamper(fmt="iso"),
+        structlog.processors.StackInfoRenderer(),
+        structlog.processors.format_exc_info,
+    ]
+    if log_path:
+        # JSON to file for machine parsing
+        processors.append(structlog.processors.JSONRenderer())
+    else:
+        # Human-readable to terminal
+        processors.append(structlog.dev.ConsoleRenderer())
+    structlog.configure(
+        processors=processors,
+        wrapper_class=structlog.stdlib.BoundLogger,
+        logger_factory=structlog.PrintLoggerFactory(
+            file=open(log_path, "a") if log_path else sys.stderr
+        ),
+    )
+    return structlog.get_logger()
+# Usage in the experiment loop
+logger = configure_logging("results/exp-001/log.jsonl")
+logger.info("run_complete",
+            run_id="run-0042",
+            metrics={"sharpe": 1.45, "max_dd": 0.12},
+            decision="keep",
+            budget_remaining={"runs": 458, "time_hours": 36.2})
+```
+### Progress Monitoring
+Track experiment progress and detect stalls:
+```python
+# src/observability/progress.py
+import time
+from dataclasses import dataclass, field
+@dataclass
+class ProgressMonitor:
+    """Track experiment loop progress and detect stalls."""
+    total_budget: int = 500
+    start_time: float = field(default_factory=time.time)
+    metric_history: list[float] = field(default_factory=list)
+    best_value: float = float("-inf")
+    best_iteration: int = 0
+    stall_threshold: int = 50  # Iterations without improvement
+    def update(self, iteration: int, metric_value: float) -> dict:
+        """Update progress and return status report."""
+        self.metric_history.append(metric_value)
+        if metric_value > self.best_value:
+            self.best_value = metric_value
+            self.best_iteration = iteration
+        elapsed = time.time() - self.start_time
+        runs_per_hour = iteration / (elapsed / 3600) if elapsed > 0 else 0
+        remaining_runs = self.total_budget - iteration
+        eta_hours = remaining_runs / runs_per_hour if runs_per_hour > 0 else float("inf")
+        stalled = (iteration - self.best_iteration) >= self.stall_threshold
+        return {
+            "iteration": iteration,
+            "current_metric": metric_value,
+            "best_metric": self.best_value,
+            "best_at_iteration": self.best_iteration,
+            "runs_since_improvement": iteration - self.best_iteration,
+            "elapsed_hours": elapsed / 3600,
+            "runs_per_hour": runs_per_hour,
+            "eta_hours": eta_hours,
+            "budget_used_pct": (iteration / self.total_budget) * 100,
+            "stalled": stalled,
+        }
+    def format_status_line(self, status: dict) -> str:
+        """Format a one-line progress summary for terminal output."""
+        return (
+            f"[{status['iteration']}/{self.total_budget}] "
+            f"best={status['best_metric']:.4f} (iter {status['best_at_iteration']}) "
+            f"current={status['current_metric']:.4f} "
+            f"rate={status['runs_per_hour']:.1f}/hr "
+            f"ETA={status['eta_hours']:.1f}h "
+            f"{'STALLED' if status['stalled'] else 'ok'}"
+        )
+```
+### Anomaly Detection on Metrics
+Detect when metric values are outside expected ranges, which may indicate data issues or bugs:
+```python
+# src/observability/anomaly.py
+import numpy as np
+from typing import Optional
+class MetricAnomalyDetector:
+    """Detect anomalous metric values using statistical bounds."""
+    def __init__(self, warmup: int = 20, z_threshold: float = 3.0):
+        self.warmup = warmup
+        self.z_threshold = z_threshold
+        self.values: list[float] = []
+    def check(self, value: float) -> Optional[dict]:
+        """
+        Check if a metric value is anomalous.
+        Returns anomaly info dict if detected, None otherwise.
+        """
+        self.values.append(value)
+        if len(self.values) < self.warmup:
+            return None  # Not enough data for reliable detection
+        arr = np.array(self.values[:-1])  # Exclude current value
+        mean = arr.mean()
+        std = arr.std()
+        if std < 1e-10:  # All values identical (degenerate case)
+            return None
+        z_score = (value - mean) / std
+        if abs(z_score) > self.z_threshold:
+            return {
+                "value": value,
+                "mean": float(mean),
+                "std": float(std),
+                "z_score": float(z_score),
+                "direction": "high" if z_score > 0 else "low",
+                "message": (
+                    f"Anomalous metric: {value:.4f} "
+                    f"(z={z_score:.2f}, expected {mean:.4f} +/- {std:.4f})"
+                ),
+            }
+        return None
+class MultiMetricAnomalyDetector:
+    """Monitor multiple metrics simultaneously."""
+    def __init__(self, metric_names: list[str], **kwargs):
+        self.detectors = {name: MetricAnomalyDetector(**kwargs) for name in metric_names}
+    def check_all(self, metrics: dict[str, float]) -> list[dict]:
+        """Check all metrics and return any anomalies found."""
+        anomalies = []
+        for name, value in metrics.items():
+            if name in self.detectors:
+                anomaly = self.detectors[name].check(value)
+                if anomaly:
+                    anomaly["metric_name"] = name
+                    anomalies.append(anomaly)
+        return anomalies
+```
+### Resource Tracking
+Monitor compute resource consumption during experiment execution:
+```python
+# src/observability/resources.py
+import os
+import time
+from dataclasses import dataclass
+@dataclass
+class ResourceSnapshot:
+    """Point-in-time resource usage."""
+    timestamp: float
+    cpu_percent: float
+    memory_mb: float
+    disk_used_mb: float
+    gpu_memory_mb: float | None = None
+    gpu_utilization_pct: float | None = None
+def capture_resources(results_dir: str) -> ResourceSnapshot:
+    """Capture current resource usage."""
+    import psutil
+    process = psutil.Process(os.getpid())
+    # CPU and memory
+    cpu_pct = process.cpu_percent(interval=0.1)
+    mem_mb = process.memory_info().rss / (1024 * 1024)
+    # Disk usage of results directory
+    disk_mb = sum(
+        f.stat().st_size for f in _walk_files(results_dir)
+    ) / (1024 * 1024)
+    # GPU (optional)
+    gpu_mem = None
+    gpu_util = None
+    try:
+        import torch
+        if torch.cuda.is_available():
+            gpu_mem = torch.cuda.memory_allocated() / (1024 * 1024)
+            # Note: utilization requires nvidia-smi or pynvml
+    except ImportError:
+        pass
+    return ResourceSnapshot(
+        timestamp=time.time(),
+        cpu_percent=cpu_pct,
+        memory_mb=mem_mb,
+        disk_used_mb=disk_mb,
+        gpu_memory_mb=gpu_mem,
+        gpu_utilization_pct=gpu_util,
+    )
+def _walk_files(directory: str):
+    """Walk directory and yield all files."""
+    from pathlib import Path
+    for path in Path(directory).rglob("*"):
+        if path.is_file():
+            yield path
+class ResourceBudgetTracker:
+    """Track cumulative resource consumption against budget."""
+    def __init__(self, max_disk_mb: float = 10240, max_memory_mb: float = 8192):
+        self.max_disk_mb = max_disk_mb
+        self.max_memory_mb = max_memory_mb
+        self.history: list[ResourceSnapshot] = []
+    def record(self, snapshot: ResourceSnapshot) -> list[str]:
+        """Record a snapshot and return any budget warnings."""
+        self.history.append(snapshot)
+        warnings = []
+        if snapshot.disk_used_mb > self.max_disk_mb * 0.8:
+            warnings.append(
+                f"Disk usage at {snapshot.disk_used_mb:.0f}MB "
+                f"({snapshot.disk_used_mb / self.max_disk_mb * 100:.0f}% of budget)"
+            )
+        if snapshot.memory_mb > self.max_memory_mb * 0.9:
+            warnings.append(
+                f"Memory usage at {snapshot.memory_mb:.0f}MB "
+                f"({snapshot.memory_mb / self.max_memory_mb * 100:.0f}% of budget)"
+            )
+        return warnings
+```
+### Alert System
+Alert on conditions that require attention:
+```python
+# src/observability/alerts.py
+from dataclasses import dataclass
+from enum import Enum
+from typing import Callable
+class AlertSeverity(Enum):
+    INFO = "info"
+    WARNING = "warning"
+    CRITICAL = "critical"
+@dataclass
+class Alert:
+    severity: AlertSeverity
+    message: str
+    metric_name: str = ""
+    value: float = 0.0
+class AlertManager:
+    """Configurable alert system for experiment monitoring."""
+    def __init__(self):
+        self.rules: list[tuple[str, Callable, AlertSeverity]] = []
+        self.fired: list[Alert] = []
+    def add_rule(self, name: str, condition: Callable[[dict], bool],
+                 severity: AlertSeverity, message_template: str) -> None:
+        self.rules.append((name, condition, severity, message_template))
+    def check(self, status: dict) -> list[Alert]:
+        """Evaluate all rules against current status."""
+        alerts = []
+        for name, condition, severity, msg_template in self.rules:
+            try:
+                if condition(status):
+                    alert = Alert(
+                        severity=severity,
+                        message=msg_template.format(**status),
+                    )
+                    alerts.append(alert)
+                    self.fired.append(alert)
+            except (KeyError, TypeError):
+                pass
+        return alerts
+# Default alert rules
+def default_alerts() -> AlertManager:
+    mgr = AlertManager()
+    mgr.add_rule(
+        "stall_warning",
+        lambda s: s.get("runs_since_improvement", 0) >= 30,
+        AlertSeverity.WARNING,
+        "Stalled: {runs_since_improvement} runs without improvement",
+    )
+    mgr.add_rule(
+        "budget_critical",
+        lambda s: s.get("budget_used_pct", 0) >= 90,
+        AlertSeverity.CRITICAL,
+        "Budget nearly exhausted: {budget_used_pct:.0f}% used",
+    )
+    mgr.add_rule(
+        "error_rate",
+        lambda s: s.get("consecutive_errors", 0) >= 5,
+        AlertSeverity.CRITICAL,
+        "High error rate: {consecutive_errors} consecutive failures",
+    )
+    return mgr
+```
+### Terminal Dashboard
+For real-time monitoring during autonomous execution:
+```python
+# src/observability/dashboard.py
+import sys
+def print_dashboard(status: dict, alerts: list, resource: dict) -> None:
+    """Print a compact terminal dashboard."""
+    # Clear and redraw
+    sys.stderr.write("\033[2J\033[H")  # Clear screen, cursor to top
+    print("=" * 60)
+    print(f"  Experiment: {status.get('experiment_id', 'unknown')}")
+    print(f"  Iteration:  {status['iteration']}/{status.get('total_budget', '?')}")
+    print(f"  Best:       {status['best_metric']:.6f} (iter {status['best_at_iteration']})")
+    print(f"  Current:    {status['current_metric']:.6f}")
+    print(f"  Rate:       {status['runs_per_hour']:.1f} runs/hr")
+    print(f"  ETA:        {status['eta_hours']:.1f} hours")
+    print("-" * 60)
+    print(f"  CPU: {resource.get('cpu_percent', 0):.0f}%  "
+          f"MEM: {resource.get('memory_mb', 0):.0f}MB  "
+          f"DISK: {resource.get('disk_used_mb', 0):.0f}MB")
+    if resource.get("gpu_memory_mb") is not None:
+        print(f"  GPU MEM: {resource['gpu_memory_mb']:.0f}MB")
+    print("-" * 60)
+    if alerts:
+        for alert in alerts:
+            prefix = "!!" if alert.severity.value == "critical" else "!"
+            print(f"  {prefix} {alert.message}")
+    else:
+        print("  No alerts")
+    print("=" * 60)
+```
+### Observability Best Practices
+1. **Log every iteration**: Even discarded runs produce valuable data about what does not work.
+2. **Structured over unstructured**: JSON lines, not free-form text. Machine-parseable logs enable automated analysis.
+3. **Separate experiment logs from system logs**: Experiment metrics go to the results directory. System health goes to stderr or a system log.
+4. **Alert on stalls, not just failures**: A loop that is not improving is wasting compute, even if it is not crashing.
+5. **Resource snapshots at regular intervals**: Capture every N iterations, not just at start and end. Memory leaks and disk growth are only visible over time.
+6. **Persistent dashboards for async review**: Write dashboard HTML to the results directory so reviewers can check progress without a live terminal session.
+7. **Cost tracking**: If running on cloud infrastructure, track estimated cost per run and alert when the cost budget is approaching its limit.

package/content/knowledge/research/research-overfitting-prevention.md ADDED Viewed

@@ -0,0 +1,306 @@
+---
+name: research-overfitting-prevention
+description: Out-of-sample validation, cross-validation strategies, statistical significance testing, and when to stop iterating to prevent overfitting
+topics: [research, overfitting, validation, cross-validation, statistical-significance, out-of-sample]
+---
+Overfitting is the central risk of iterative research. Every time an agent evaluates a hypothesis against data and uses the result to guide the next hypothesis, it is implicitly fitting to that data. After hundreds of iterations, even random strategies will appear to perform well on the evaluation set -- this is multiple comparisons bias. Preventing overfitting requires rigorous separation of training and evaluation data, statistical significance testing, and disciplined stopping criteria.
+## Summary
+Split data into train, validation, and holdout sets. Use the validation set for iteration decisions (keep/discard) and reserve the holdout set for final evaluation only -- never let the holdout set influence any iteration decision. Apply cross-validation for small datasets. Use statistical significance tests (permutation tests, bootstrap confidence intervals) to verify that results are real, not noise. Stop iterating when the improvement per iteration falls below the noise floor.
+## Deep Guidance
+### Data Splitting for Research
+The data split for research projects has three levels, not two:
+```
+┌─────────────────────────────────────────────────┐
+│                 Full Dataset                     │
+├──────────────────┬────────────┬─────────────────┤
+│   Training Set   │ Validation │   Holdout Set   │
+│   (60-70%)       │  (15-20%)  │   (15-20%)      │
+├──────────────────┼────────────┼─────────────────┤
+│ Strategy learns  │ Keep/discard│ Final eval ONLY │
+│ from this data   │ decisions   │ Touch once      │
+└──────────────────┴────────────┴─────────────────┘
+```
+**Critical rule**: The holdout set is touched exactly once -- at the very end of the research project, to report final results. If the holdout set is used to make any iteration decision, it becomes a validation set and loses its value.
+```python
+# src/data/splitter.py
+import numpy as np
+from dataclasses import dataclass
+from typing import Any
+@dataclass
+class DataSplit:
+    """Three-way data split for research projects."""
+    train: Any
+    validation: Any
+    holdout: Any
+def temporal_split(data: np.ndarray, train_frac: float = 0.6,
+                   val_frac: float = 0.2) -> DataSplit:
+    """
+    Temporal split for time-series data.
+    MUST be chronological -- never shuffle time-series data.
+    """
+    n = len(data)
+    train_end = int(n * train_frac)
+    val_end = int(n * (train_frac + val_frac))
+    return DataSplit(
+        train=data[:train_end],
+        validation=data[train_end:val_end],
+        holdout=data[val_end:],
+    )
+def random_split(data: np.ndarray, train_frac: float = 0.6,
+                 val_frac: float = 0.2, seed: int = 42) -> DataSplit:
+    """
+    Random split for non-temporal data.
+    Use when data points are independent (no time ordering).
+    """
+    rng = np.random.default_rng(seed)
+    indices = rng.permutation(len(data))
+    n = len(data)
+    train_end = int(n * train_frac)
+    val_end = int(n * (train_frac + val_frac))
+    return DataSplit(
+        train=data[indices[:train_end]],
+        validation=data[indices[train_end:val_end]],
+        holdout=data[indices[val_end:]],
+    )
+```
+### Walk-Forward Validation (Time Series)
+For time-series research (trading strategies, forecasting), use walk-forward validation instead of random cross-validation:
+```python
+# src/evaluation/walk_forward.py
+import numpy as np
+from dataclasses import dataclass
+@dataclass
+class WalkForwardWindow:
+    train_start: int
+    train_end: int
+    test_start: int
+    test_end: int
+def walk_forward_splits(n_samples: int, train_window: int,
+                        test_window: int, step: int | None = None
+                        ) -> list[WalkForwardWindow]:
+    """
+    Generate walk-forward validation windows.
+    Produces rolling train/test splits that move forward in time:
+      [train_0][test_0]
+         [train_1][test_1]
+            [train_2][test_2]
+    """
+    if step is None:
+        step = test_window
+    windows = []
+    start = 0
+    while start + train_window + test_window <= n_samples:
+        windows.append(WalkForwardWindow(
+            train_start=start,
+            train_end=start + train_window,
+            test_start=start + train_window,
+            test_end=start + train_window + test_window,
+        ))
+        start += step
+    return windows
+def walk_forward_evaluate(strategy, data, train_window: int = 252,
+                          test_window: int = 63) -> list[dict]:
+    """
+    Evaluate a strategy using walk-forward analysis.
+    Returns metrics for each window.
+    """
+    windows = walk_forward_splits(len(data), train_window, test_window)
+    results = []
+    for w in windows:
+        train_data = data[w.train_start:w.train_end]
+        test_data = data[w.test_start:w.test_end]
+        strategy.fit(train_data)
+        metrics = strategy.evaluate(test_data)
+        results.append({
+            "window": f"{w.test_start}-{w.test_end}",
+            **metrics,
+        })
+    return results
+```
+### Cross-Validation for Small Datasets
+When the dataset is too small for a three-way split, use k-fold cross-validation on the train+validation portion, keeping the holdout untouched:
+```python
+# src/evaluation/cross_validation.py
+import numpy as np
+def stratified_kfold_evaluate(strategy_factory, data, labels,
+                               k: int = 5, seed: int = 42) -> dict:
+    """
+    K-fold cross-validation with stratified splits.
+    Returns mean and std of metrics across folds.
+    """
+    rng = np.random.default_rng(seed)
+    indices = rng.permutation(len(data))
+    fold_size = len(data) // k
+    all_metrics = []
+    for i in range(k):
+        test_idx = indices[i * fold_size:(i + 1) * fold_size]
+        train_idx = np.concatenate([
+            indices[:i * fold_size],
+            indices[(i + 1) * fold_size:],
+        ])
+        strategy = strategy_factory()  # Fresh instance per fold
+        strategy.fit(data[train_idx], labels[train_idx])
+        metrics = strategy.evaluate(data[test_idx], labels[test_idx])
+        all_metrics.append(metrics)
+    # Aggregate across folds
+    metric_names = all_metrics[0].keys()
+    return {
+        name: {
+            "mean": np.mean([m[name] for m in all_metrics]),
+            "std": np.std([m[name] for m in all_metrics]),
+            "per_fold": [m[name] for m in all_metrics],
+        }
+        for name in metric_names
+    }
+```
+### Statistical Significance Testing
+After hundreds of iterations, a strategy that appears to beat the baseline may be a statistical artifact. Test significance before accepting:
+```python
+# src/evaluation/statistical.py
+import numpy as np
+def permutation_test(strategy_returns: np.ndarray, baseline_returns: np.ndarray,
+                     n_permutations: int = 10000, seed: int = 42) -> dict:
+    """
+    Permutation test for difference in mean returns.
+    Tests H0: strategy and baseline come from the same distribution.
+    """
+    rng = np.random.default_rng(seed)
+    observed_diff = strategy_returns.mean() - baseline_returns.mean()
+    combined = np.concatenate([strategy_returns, baseline_returns])
+    n_strategy = len(strategy_returns)
+    count_extreme = 0
+    for _ in range(n_permutations):
+        perm = rng.permutation(combined)
+        perm_diff = perm[:n_strategy].mean() - perm[n_strategy:].mean()
+        if perm_diff >= observed_diff:
+            count_extreme += 1
+    p_value = (count_extreme + 1) / (n_permutations + 1)
+    return {
+        "observed_difference": float(observed_diff),
+        "p_value": float(p_value),
+        "significant_at_005": p_value < 0.05,
+        "significant_at_001": p_value < 0.01,
+        "n_permutations": n_permutations,
+    }
+def bootstrap_confidence_interval(values: np.ndarray, statistic=np.mean,
+                                   confidence: float = 0.95,
+                                   n_bootstrap: int = 10000,
+                                   seed: int = 42) -> dict:
+    """
+    Bootstrap confidence interval for a statistic.
+    Use to estimate uncertainty on experiment metrics.
+    """
+    rng = np.random.default_rng(seed)
+    bootstrap_stats = []
+    for _ in range(n_bootstrap):
+        sample = rng.choice(values, size=len(values), replace=True)
+        bootstrap_stats.append(statistic(sample))
+    bootstrap_stats = np.array(bootstrap_stats)
+    alpha = (1 - confidence) / 2
+    lower = np.percentile(bootstrap_stats, 100 * alpha)
+    upper = np.percentile(bootstrap_stats, 100 * (1 - alpha))
+    return {
+        "point_estimate": float(statistic(values)),
+        "lower": float(lower),
+        "upper": float(upper),
+        "confidence": confidence,
+    }
+```
+### Multiple Comparisons Correction
+When testing many hypotheses, the probability of at least one false positive increases. Correct for this:
+```python
+def bonferroni_threshold(base_alpha: float, n_comparisons: int) -> float:
+    """
+    Bonferroni correction: divide alpha by number of comparisons.
+    Conservative but simple.
+    """
+    return base_alpha / n_comparisons
+def holm_bonferroni(p_values: list[float], alpha: float = 0.05) -> list[bool]:
+    """
+    Holm-Bonferroni step-down procedure.
+    Less conservative than Bonferroni while controlling family-wise error.
+    """
+    n = len(p_values)
+    sorted_indices = np.argsort(p_values)
+    sorted_pvals = np.array(p_values)[sorted_indices]
+    significant = [False] * n
+    for i, (idx, pval) in enumerate(zip(sorted_indices, sorted_pvals)):
+        adjusted_alpha = alpha / (n - i)
+        if pval <= adjusted_alpha:
+            significant[idx] = True
+        else:
+            break  # Stop at first non-rejection
+    return significant
+```
+### When to Stop Iterating
+Practical decision framework:
+| Signal | Action | Example |
+|--------|--------|---------|
+| Primary metric met target | Stop, run holdout eval | Sharpe > 1.5 on validation |
+| Convergence detected | Stop, run holdout eval | Mean Sharpe unchanged for 50 runs |
+| Budget exhausted | Stop, report best result | 500 runs completed |
+| All improvements not significant | Stop, report negative result | p > 0.05 for all improvements |
+| Validation improving but train degrading | Investigate -- possible bug | Opposite curves on train/val |
+| Holdout result much worse than validation | Report overfitting, do not deploy | Sharpe 1.5 val, 0.3 holdout |
+### Overfitting Red Flags
+Watch for these warning signs during iteration:
+1. **Validation metric much better than cross-validation mean**: The specific validation split may be easy. Use CV to get a robust estimate.
+2. **Improvement from many small parameters**: Complex models with many tuned parameters are more likely to overfit than simple models.
+3. **Results sensitive to data ordering**: If shuffling the validation set changes the result significantly, the sample size is too small.
+4. **Monotonically improving metrics across iterations**: Real research has noise. If every iteration is better than the last, something is leaking.
+5. **Results do not replicate across time periods**: A strategy that works on 2020-2022 but fails on 2023 is likely overfit to the training period.