@zigrivers/scaffold 3.13.0 → 3.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (180) hide show
  1. package/README.md +32 -10
  2. package/content/knowledge/research/research-architecture.md +385 -0
  3. package/content/knowledge/research/research-conventions.md +248 -0
  4. package/content/knowledge/research/research-dev-environment.md +303 -0
  5. package/content/knowledge/research/research-experiment-loop.md +429 -0
  6. package/content/knowledge/research/research-experiment-tracking.md +336 -0
  7. package/content/knowledge/research/research-ml-architecture-search.md +383 -0
  8. package/content/knowledge/research/research-ml-evaluation.md +407 -0
  9. package/content/knowledge/research/research-ml-experiment-tracking.md +466 -0
  10. package/content/knowledge/research/research-ml-training-patterns.md +413 -0
  11. package/content/knowledge/research/research-observability.md +395 -0
  12. package/content/knowledge/research/research-overfitting-prevention.md +306 -0
  13. package/content/knowledge/research/research-project-structure.md +264 -0
  14. package/content/knowledge/research/research-quant-backtesting.md +326 -0
  15. package/content/knowledge/research/research-quant-market-data.md +366 -0
  16. package/content/knowledge/research/research-quant-metrics.md +335 -0
  17. package/content/knowledge/research/research-quant-requirements.md +223 -0
  18. package/content/knowledge/research/research-quant-risk.md +469 -0
  19. package/content/knowledge/research/research-quant-strategy-patterns.md +412 -0
  20. package/content/knowledge/research/research-requirements.md +201 -0
  21. package/content/knowledge/research/research-security.md +374 -0
  22. package/content/knowledge/research/research-sim-compute-management.md +538 -0
  23. package/content/knowledge/research/research-sim-engine-patterns.md +448 -0
  24. package/content/knowledge/research/research-sim-parameter-spaces.md +425 -0
  25. package/content/knowledge/research/research-sim-validation.md +456 -0
  26. package/content/knowledge/research/research-testing.md +334 -0
  27. package/content/methodology/research-ml-research.yml +23 -0
  28. package/content/methodology/research-overlay.yml +65 -0
  29. package/content/methodology/research-quant-finance.yml +29 -0
  30. package/content/methodology/research-simulation.yml +23 -0
  31. package/dist/cli/commands/adopt.d.ts.map +1 -1
  32. package/dist/cli/commands/adopt.js +30 -8
  33. package/dist/cli/commands/adopt.js.map +1 -1
  34. package/dist/cli/commands/adopt.serialization.test.js +49 -0
  35. package/dist/cli/commands/adopt.serialization.test.js.map +1 -1
  36. package/dist/cli/commands/adopt.test.js +8 -0
  37. package/dist/cli/commands/adopt.test.js.map +1 -1
  38. package/dist/cli/commands/build.d.ts.map +1 -1
  39. package/dist/cli/commands/build.js +191 -180
  40. package/dist/cli/commands/build.js.map +1 -1
  41. package/dist/cli/commands/complete.d.ts.map +1 -1
  42. package/dist/cli/commands/complete.js +16 -12
  43. package/dist/cli/commands/complete.js.map +1 -1
  44. package/dist/cli/commands/complete.test.js +14 -5
  45. package/dist/cli/commands/complete.test.js.map +1 -1
  46. package/dist/cli/commands/init.d.ts +4 -0
  47. package/dist/cli/commands/init.d.ts.map +1 -1
  48. package/dist/cli/commands/init.js +75 -51
  49. package/dist/cli/commands/init.js.map +1 -1
  50. package/dist/cli/commands/init.test.js +33 -27
  51. package/dist/cli/commands/init.test.js.map +1 -1
  52. package/dist/cli/commands/reset.d.ts.map +1 -1
  53. package/dist/cli/commands/reset.js +44 -40
  54. package/dist/cli/commands/reset.js.map +1 -1
  55. package/dist/cli/commands/reset.test.js +42 -20
  56. package/dist/cli/commands/reset.test.js.map +1 -1
  57. package/dist/cli/commands/rework.d.ts.map +1 -1
  58. package/dist/cli/commands/rework.js +16 -12
  59. package/dist/cli/commands/rework.js.map +1 -1
  60. package/dist/cli/commands/rework.test.js +12 -3
  61. package/dist/cli/commands/rework.test.js.map +1 -1
  62. package/dist/cli/commands/run.d.ts.map +1 -1
  63. package/dist/cli/commands/run.js +318 -298
  64. package/dist/cli/commands/run.js.map +1 -1
  65. package/dist/cli/commands/run.test.js +92 -120
  66. package/dist/cli/commands/run.test.js.map +1 -1
  67. package/dist/cli/commands/skip.d.ts.map +1 -1
  68. package/dist/cli/commands/skip.js +19 -15
  69. package/dist/cli/commands/skip.js.map +1 -1
  70. package/dist/cli/commands/skip.test.js +22 -11
  71. package/dist/cli/commands/skip.test.js.map +1 -1
  72. package/dist/cli/commands/update.d.ts.map +1 -1
  73. package/dist/cli/commands/update.js +3 -1
  74. package/dist/cli/commands/update.js.map +1 -1
  75. package/dist/cli/commands/update.test.js +8 -4
  76. package/dist/cli/commands/update.test.js.map +1 -1
  77. package/dist/cli/commands/version.d.ts.map +1 -1
  78. package/dist/cli/commands/version.js +3 -1
  79. package/dist/cli/commands/version.js.map +1 -1
  80. package/dist/cli/commands/version.test.js +9 -5
  81. package/dist/cli/commands/version.test.js.map +1 -1
  82. package/dist/cli/index.d.ts.map +1 -1
  83. package/dist/cli/index.js +2 -0
  84. package/dist/cli/index.js.map +1 -1
  85. package/dist/cli/init-flag-families.d.ts +6 -1
  86. package/dist/cli/init-flag-families.d.ts.map +1 -1
  87. package/dist/cli/init-flag-families.js +32 -1
  88. package/dist/cli/init-flag-families.js.map +1 -1
  89. package/dist/cli/init-flag-families.test.js +47 -0
  90. package/dist/cli/init-flag-families.test.js.map +1 -1
  91. package/dist/cli/output/interactive.d.ts +1 -0
  92. package/dist/cli/output/interactive.d.ts.map +1 -1
  93. package/dist/cli/output/interactive.js +5 -0
  94. package/dist/cli/output/interactive.js.map +1 -1
  95. package/dist/cli/shutdown.d.ts +51 -0
  96. package/dist/cli/shutdown.d.ts.map +1 -0
  97. package/dist/cli/shutdown.js +199 -0
  98. package/dist/cli/shutdown.js.map +1 -0
  99. package/dist/cli/shutdown.test.d.ts +2 -0
  100. package/dist/cli/shutdown.test.d.ts.map +1 -0
  101. package/dist/cli/shutdown.test.js +316 -0
  102. package/dist/cli/shutdown.test.js.map +1 -0
  103. package/dist/config/schema.d.ts +272 -16
  104. package/dist/config/schema.d.ts.map +1 -1
  105. package/dist/config/schema.js +25 -1
  106. package/dist/config/schema.js.map +1 -1
  107. package/dist/config/schema.test.js +103 -3
  108. package/dist/config/schema.test.js.map +1 -1
  109. package/dist/core/assembly/overlay-loader.d.ts +12 -0
  110. package/dist/core/assembly/overlay-loader.d.ts.map +1 -1
  111. package/dist/core/assembly/overlay-loader.js +30 -0
  112. package/dist/core/assembly/overlay-loader.js.map +1 -1
  113. package/dist/core/assembly/overlay-loader.test.js +66 -1
  114. package/dist/core/assembly/overlay-loader.test.js.map +1 -1
  115. package/dist/core/assembly/overlay-state-resolver.d.ts.map +1 -1
  116. package/dist/core/assembly/overlay-state-resolver.js +48 -19
  117. package/dist/core/assembly/overlay-state-resolver.js.map +1 -1
  118. package/dist/core/assembly/overlay-state-resolver.test.js +80 -0
  119. package/dist/core/assembly/overlay-state-resolver.test.js.map +1 -1
  120. package/dist/e2e/init.test.js +5 -4
  121. package/dist/e2e/init.test.js.map +1 -1
  122. package/dist/e2e/project-type-overlays.test.js +119 -0
  123. package/dist/e2e/project-type-overlays.test.js.map +1 -1
  124. package/dist/project/adopt.d.ts.map +1 -1
  125. package/dist/project/adopt.js +3 -1
  126. package/dist/project/adopt.js.map +1 -1
  127. package/dist/project/detectors/disambiguate.js +1 -1
  128. package/dist/project/detectors/disambiguate.js.map +1 -1
  129. package/dist/project/detectors/index.d.ts.map +1 -1
  130. package/dist/project/detectors/index.js +2 -1
  131. package/dist/project/detectors/index.js.map +1 -1
  132. package/dist/project/detectors/ml.d.ts.map +1 -1
  133. package/dist/project/detectors/ml.js +2 -6
  134. package/dist/project/detectors/ml.js.map +1 -1
  135. package/dist/project/detectors/research.d.ts +4 -0
  136. package/dist/project/detectors/research.d.ts.map +1 -0
  137. package/dist/project/detectors/research.js +141 -0
  138. package/dist/project/detectors/research.js.map +1 -0
  139. package/dist/project/detectors/research.test.d.ts +2 -0
  140. package/dist/project/detectors/research.test.d.ts.map +1 -0
  141. package/dist/project/detectors/research.test.js +235 -0
  142. package/dist/project/detectors/research.test.js.map +1 -0
  143. package/dist/project/detectors/shared-signals.d.ts +3 -0
  144. package/dist/project/detectors/shared-signals.d.ts.map +1 -0
  145. package/dist/project/detectors/shared-signals.js +9 -0
  146. package/dist/project/detectors/shared-signals.js.map +1 -0
  147. package/dist/project/detectors/types.d.ts +6 -2
  148. package/dist/project/detectors/types.d.ts.map +1 -1
  149. package/dist/project/detectors/types.js.map +1 -1
  150. package/dist/state/lock-manager.d.ts +1 -0
  151. package/dist/state/lock-manager.d.ts.map +1 -1
  152. package/dist/state/lock-manager.js +1 -1
  153. package/dist/state/lock-manager.js.map +1 -1
  154. package/dist/types/config.d.ts +7 -1
  155. package/dist/types/config.d.ts.map +1 -1
  156. package/dist/wizard/copy/core.d.ts.map +1 -1
  157. package/dist/wizard/copy/core.js +4 -0
  158. package/dist/wizard/copy/core.js.map +1 -1
  159. package/dist/wizard/copy/index.d.ts.map +1 -1
  160. package/dist/wizard/copy/index.js +2 -0
  161. package/dist/wizard/copy/index.js.map +1 -1
  162. package/dist/wizard/copy/research.d.ts +3 -0
  163. package/dist/wizard/copy/research.d.ts.map +1 -0
  164. package/dist/wizard/copy/research.js +27 -0
  165. package/dist/wizard/copy/research.js.map +1 -0
  166. package/dist/wizard/copy/types.d.ts +5 -1
  167. package/dist/wizard/copy/types.d.ts.map +1 -1
  168. package/dist/wizard/flags.d.ts +7 -1
  169. package/dist/wizard/flags.d.ts.map +1 -1
  170. package/dist/wizard/questions.d.ts +4 -2
  171. package/dist/wizard/questions.d.ts.map +1 -1
  172. package/dist/wizard/questions.js +27 -1
  173. package/dist/wizard/questions.js.map +1 -1
  174. package/dist/wizard/questions.test.js +51 -0
  175. package/dist/wizard/questions.test.js.map +1 -1
  176. package/dist/wizard/wizard.d.ts +3 -2
  177. package/dist/wizard/wizard.d.ts.map +1 -1
  178. package/dist/wizard/wizard.js +3 -1
  179. package/dist/wizard/wizard.js.map +1 -1
  180. package/package.json +1 -1
@@ -0,0 +1,395 @@
1
+ ---
2
+ name: research-observability
3
+ description: Monitoring experiment loops including anomaly detection, resource tracking, progress dashboards, and alert thresholds for research projects
4
+ topics: [research, observability, monitoring, anomaly-detection, resource-tracking, dashboards, alerts]
5
+ ---
6
+
7
+ Autonomous experiment loops can run for hours or days without human attention. Without observability, a loop can waste compute on a converged metric, silently produce garbage after a data pipeline failure, or exhaust disk space without anyone noticing. Observability for research is not about uptime SLAs -- it is about experiment health: is the loop making progress, are the results valid, and are resources being consumed at a reasonable rate.
8
+
9
+ ## Summary
10
+
11
+ Monitor three dimensions of experiment loop health: progress (is the primary metric improving, how much budget remains), validity (are results within expected ranges, are there anomalies in metric distributions), and resources (CPU, memory, disk, GPU utilization, and cost). Implement structured logging with metric history, anomaly detection on metric time series, and alerting for budget thresholds and stalled progress. Provide both real-time terminal output and persistent dashboards for async review.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Structured Logging
16
+
17
+ Use structured logging (JSON lines) for all experiment output so it can be parsed programmatically:
18
+
19
+ ```python
20
+ # src/observability/structured_log.py
21
+ import structlog
22
+ import sys
23
+
24
+ def configure_logging(log_path: str | None = None, level: str = "INFO"):
25
+ """Configure structured logging for the experiment loop."""
26
+ processors = [
27
+ structlog.stdlib.add_log_level,
28
+ structlog.stdlib.add_logger_name,
29
+ structlog.processors.TimeStamper(fmt="iso"),
30
+ structlog.processors.StackInfoRenderer(),
31
+ structlog.processors.format_exc_info,
32
+ ]
33
+
34
+ if log_path:
35
+ # JSON to file for machine parsing
36
+ processors.append(structlog.processors.JSONRenderer())
37
+ else:
38
+ # Human-readable to terminal
39
+ processors.append(structlog.dev.ConsoleRenderer())
40
+
41
+ structlog.configure(
42
+ processors=processors,
43
+ wrapper_class=structlog.stdlib.BoundLogger,
44
+ logger_factory=structlog.PrintLoggerFactory(
45
+ file=open(log_path, "a") if log_path else sys.stderr
46
+ ),
47
+ )
48
+
49
+ return structlog.get_logger()
50
+
51
+ # Usage in the experiment loop
52
+ logger = configure_logging("results/exp-001/log.jsonl")
53
+ logger.info("run_complete",
54
+ run_id="run-0042",
55
+ metrics={"sharpe": 1.45, "max_dd": 0.12},
56
+ decision="keep",
57
+ budget_remaining={"runs": 458, "time_hours": 36.2})
58
+ ```
59
+
60
+ ### Progress Monitoring
61
+
62
+ Track experiment progress and detect stalls:
63
+
64
+ ```python
65
+ # src/observability/progress.py
66
+ import time
67
+ from dataclasses import dataclass, field
68
+
69
+ @dataclass
70
+ class ProgressMonitor:
71
+ """Track experiment loop progress and detect stalls."""
72
+ total_budget: int = 500
73
+ start_time: float = field(default_factory=time.time)
74
+ metric_history: list[float] = field(default_factory=list)
75
+ best_value: float = float("-inf")
76
+ best_iteration: int = 0
77
+ stall_threshold: int = 50 # Iterations without improvement
78
+
79
+ def update(self, iteration: int, metric_value: float) -> dict:
80
+ """Update progress and return status report."""
81
+ self.metric_history.append(metric_value)
82
+
83
+ if metric_value > self.best_value:
84
+ self.best_value = metric_value
85
+ self.best_iteration = iteration
86
+
87
+ elapsed = time.time() - self.start_time
88
+ runs_per_hour = iteration / (elapsed / 3600) if elapsed > 0 else 0
89
+ remaining_runs = self.total_budget - iteration
90
+ eta_hours = remaining_runs / runs_per_hour if runs_per_hour > 0 else float("inf")
91
+
92
+ stalled = (iteration - self.best_iteration) >= self.stall_threshold
93
+
94
+ return {
95
+ "iteration": iteration,
96
+ "current_metric": metric_value,
97
+ "best_metric": self.best_value,
98
+ "best_at_iteration": self.best_iteration,
99
+ "runs_since_improvement": iteration - self.best_iteration,
100
+ "elapsed_hours": elapsed / 3600,
101
+ "runs_per_hour": runs_per_hour,
102
+ "eta_hours": eta_hours,
103
+ "budget_used_pct": (iteration / self.total_budget) * 100,
104
+ "stalled": stalled,
105
+ }
106
+
107
+ def format_status_line(self, status: dict) -> str:
108
+ """Format a one-line progress summary for terminal output."""
109
+ return (
110
+ f"[{status['iteration']}/{self.total_budget}] "
111
+ f"best={status['best_metric']:.4f} (iter {status['best_at_iteration']}) "
112
+ f"current={status['current_metric']:.4f} "
113
+ f"rate={status['runs_per_hour']:.1f}/hr "
114
+ f"ETA={status['eta_hours']:.1f}h "
115
+ f"{'STALLED' if status['stalled'] else 'ok'}"
116
+ )
117
+ ```
118
+
119
+ ### Anomaly Detection on Metrics
120
+
121
+ Detect when metric values are outside expected ranges, which may indicate data issues or bugs:
122
+
123
+ ```python
124
+ # src/observability/anomaly.py
125
+ import numpy as np
126
+ from typing import Optional
127
+
128
+ class MetricAnomalyDetector:
129
+ """Detect anomalous metric values using statistical bounds."""
130
+
131
+ def __init__(self, warmup: int = 20, z_threshold: float = 3.0):
132
+ self.warmup = warmup
133
+ self.z_threshold = z_threshold
134
+ self.values: list[float] = []
135
+
136
+ def check(self, value: float) -> Optional[dict]:
137
+ """
138
+ Check if a metric value is anomalous.
139
+ Returns anomaly info dict if detected, None otherwise.
140
+ """
141
+ self.values.append(value)
142
+
143
+ if len(self.values) < self.warmup:
144
+ return None # Not enough data for reliable detection
145
+
146
+ arr = np.array(self.values[:-1]) # Exclude current value
147
+ mean = arr.mean()
148
+ std = arr.std()
149
+
150
+ if std < 1e-10: # All values identical (degenerate case)
151
+ return None
152
+
153
+ z_score = (value - mean) / std
154
+
155
+ if abs(z_score) > self.z_threshold:
156
+ return {
157
+ "value": value,
158
+ "mean": float(mean),
159
+ "std": float(std),
160
+ "z_score": float(z_score),
161
+ "direction": "high" if z_score > 0 else "low",
162
+ "message": (
163
+ f"Anomalous metric: {value:.4f} "
164
+ f"(z={z_score:.2f}, expected {mean:.4f} +/- {std:.4f})"
165
+ ),
166
+ }
167
+
168
+ return None
169
+
170
+ class MultiMetricAnomalyDetector:
171
+ """Monitor multiple metrics simultaneously."""
172
+
173
+ def __init__(self, metric_names: list[str], **kwargs):
174
+ self.detectors = {name: MetricAnomalyDetector(**kwargs) for name in metric_names}
175
+
176
+ def check_all(self, metrics: dict[str, float]) -> list[dict]:
177
+ """Check all metrics and return any anomalies found."""
178
+ anomalies = []
179
+ for name, value in metrics.items():
180
+ if name in self.detectors:
181
+ anomaly = self.detectors[name].check(value)
182
+ if anomaly:
183
+ anomaly["metric_name"] = name
184
+ anomalies.append(anomaly)
185
+ return anomalies
186
+ ```
187
+
188
+ ### Resource Tracking
189
+
190
+ Monitor compute resource consumption during experiment execution:
191
+
192
+ ```python
193
+ # src/observability/resources.py
194
+ import os
195
+ import time
196
+ from dataclasses import dataclass
197
+
198
+ @dataclass
199
+ class ResourceSnapshot:
200
+ """Point-in-time resource usage."""
201
+ timestamp: float
202
+ cpu_percent: float
203
+ memory_mb: float
204
+ disk_used_mb: float
205
+ gpu_memory_mb: float | None = None
206
+ gpu_utilization_pct: float | None = None
207
+
208
+ def capture_resources(results_dir: str) -> ResourceSnapshot:
209
+ """Capture current resource usage."""
210
+ import psutil
211
+
212
+ process = psutil.Process(os.getpid())
213
+
214
+ # CPU and memory
215
+ cpu_pct = process.cpu_percent(interval=0.1)
216
+ mem_mb = process.memory_info().rss / (1024 * 1024)
217
+
218
+ # Disk usage of results directory
219
+ disk_mb = sum(
220
+ f.stat().st_size for f in _walk_files(results_dir)
221
+ ) / (1024 * 1024)
222
+
223
+ # GPU (optional)
224
+ gpu_mem = None
225
+ gpu_util = None
226
+ try:
227
+ import torch
228
+ if torch.cuda.is_available():
229
+ gpu_mem = torch.cuda.memory_allocated() / (1024 * 1024)
230
+ # Note: utilization requires nvidia-smi or pynvml
231
+ except ImportError:
232
+ pass
233
+
234
+ return ResourceSnapshot(
235
+ timestamp=time.time(),
236
+ cpu_percent=cpu_pct,
237
+ memory_mb=mem_mb,
238
+ disk_used_mb=disk_mb,
239
+ gpu_memory_mb=gpu_mem,
240
+ gpu_utilization_pct=gpu_util,
241
+ )
242
+
243
+ def _walk_files(directory: str):
244
+ """Walk directory and yield all files."""
245
+ from pathlib import Path
246
+ for path in Path(directory).rglob("*"):
247
+ if path.is_file():
248
+ yield path
249
+
250
+ class ResourceBudgetTracker:
251
+ """Track cumulative resource consumption against budget."""
252
+
253
+ def __init__(self, max_disk_mb: float = 10240, max_memory_mb: float = 8192):
254
+ self.max_disk_mb = max_disk_mb
255
+ self.max_memory_mb = max_memory_mb
256
+ self.history: list[ResourceSnapshot] = []
257
+
258
+ def record(self, snapshot: ResourceSnapshot) -> list[str]:
259
+ """Record a snapshot and return any budget warnings."""
260
+ self.history.append(snapshot)
261
+ warnings = []
262
+
263
+ if snapshot.disk_used_mb > self.max_disk_mb * 0.8:
264
+ warnings.append(
265
+ f"Disk usage at {snapshot.disk_used_mb:.0f}MB "
266
+ f"({snapshot.disk_used_mb / self.max_disk_mb * 100:.0f}% of budget)"
267
+ )
268
+ if snapshot.memory_mb > self.max_memory_mb * 0.9:
269
+ warnings.append(
270
+ f"Memory usage at {snapshot.memory_mb:.0f}MB "
271
+ f"({snapshot.memory_mb / self.max_memory_mb * 100:.0f}% of budget)"
272
+ )
273
+
274
+ return warnings
275
+ ```
276
+
277
+ ### Alert System
278
+
279
+ Alert on conditions that require attention:
280
+
281
+ ```python
282
+ # src/observability/alerts.py
283
+ from dataclasses import dataclass
284
+ from enum import Enum
285
+ from typing import Callable
286
+
287
+ class AlertSeverity(Enum):
288
+ INFO = "info"
289
+ WARNING = "warning"
290
+ CRITICAL = "critical"
291
+
292
+ @dataclass
293
+ class Alert:
294
+ severity: AlertSeverity
295
+ message: str
296
+ metric_name: str = ""
297
+ value: float = 0.0
298
+
299
+ class AlertManager:
300
+ """Configurable alert system for experiment monitoring."""
301
+
302
+ def __init__(self):
303
+ self.rules: list[tuple[str, Callable, AlertSeverity]] = []
304
+ self.fired: list[Alert] = []
305
+
306
+ def add_rule(self, name: str, condition: Callable[[dict], bool],
307
+ severity: AlertSeverity, message_template: str) -> None:
308
+ self.rules.append((name, condition, severity, message_template))
309
+
310
+ def check(self, status: dict) -> list[Alert]:
311
+ """Evaluate all rules against current status."""
312
+ alerts = []
313
+ for name, condition, severity, msg_template in self.rules:
314
+ try:
315
+ if condition(status):
316
+ alert = Alert(
317
+ severity=severity,
318
+ message=msg_template.format(**status),
319
+ )
320
+ alerts.append(alert)
321
+ self.fired.append(alert)
322
+ except (KeyError, TypeError):
323
+ pass
324
+ return alerts
325
+
326
+ # Default alert rules
327
+ def default_alerts() -> AlertManager:
328
+ mgr = AlertManager()
329
+ mgr.add_rule(
330
+ "stall_warning",
331
+ lambda s: s.get("runs_since_improvement", 0) >= 30,
332
+ AlertSeverity.WARNING,
333
+ "Stalled: {runs_since_improvement} runs without improvement",
334
+ )
335
+ mgr.add_rule(
336
+ "budget_critical",
337
+ lambda s: s.get("budget_used_pct", 0) >= 90,
338
+ AlertSeverity.CRITICAL,
339
+ "Budget nearly exhausted: {budget_used_pct:.0f}% used",
340
+ )
341
+ mgr.add_rule(
342
+ "error_rate",
343
+ lambda s: s.get("consecutive_errors", 0) >= 5,
344
+ AlertSeverity.CRITICAL,
345
+ "High error rate: {consecutive_errors} consecutive failures",
346
+ )
347
+ return mgr
348
+ ```
349
+
350
+ ### Terminal Dashboard
351
+
352
+ For real-time monitoring during autonomous execution:
353
+
354
+ ```python
355
+ # src/observability/dashboard.py
356
+ import sys
357
+
358
+ def print_dashboard(status: dict, alerts: list, resource: dict) -> None:
359
+ """Print a compact terminal dashboard."""
360
+ # Clear and redraw
361
+ sys.stderr.write("\033[2J\033[H") # Clear screen, cursor to top
362
+
363
+ print("=" * 60)
364
+ print(f" Experiment: {status.get('experiment_id', 'unknown')}")
365
+ print(f" Iteration: {status['iteration']}/{status.get('total_budget', '?')}")
366
+ print(f" Best: {status['best_metric']:.6f} (iter {status['best_at_iteration']})")
367
+ print(f" Current: {status['current_metric']:.6f}")
368
+ print(f" Rate: {status['runs_per_hour']:.1f} runs/hr")
369
+ print(f" ETA: {status['eta_hours']:.1f} hours")
370
+ print("-" * 60)
371
+ print(f" CPU: {resource.get('cpu_percent', 0):.0f}% "
372
+ f"MEM: {resource.get('memory_mb', 0):.0f}MB "
373
+ f"DISK: {resource.get('disk_used_mb', 0):.0f}MB")
374
+ if resource.get("gpu_memory_mb") is not None:
375
+ print(f" GPU MEM: {resource['gpu_memory_mb']:.0f}MB")
376
+ print("-" * 60)
377
+
378
+ if alerts:
379
+ for alert in alerts:
380
+ prefix = "!!" if alert.severity.value == "critical" else "!"
381
+ print(f" {prefix} {alert.message}")
382
+ else:
383
+ print(" No alerts")
384
+ print("=" * 60)
385
+ ```
386
+
387
+ ### Observability Best Practices
388
+
389
+ 1. **Log every iteration**: Even discarded runs produce valuable data about what does not work.
390
+ 2. **Structured over unstructured**: JSON lines, not free-form text. Machine-parseable logs enable automated analysis.
391
+ 3. **Separate experiment logs from system logs**: Experiment metrics go to the results directory. System health goes to stderr or a system log.
392
+ 4. **Alert on stalls, not just failures**: A loop that is not improving is wasting compute, even if it is not crashing.
393
+ 5. **Resource snapshots at regular intervals**: Capture every N iterations, not just at start and end. Memory leaks and disk growth are only visible over time.
394
+ 6. **Persistent dashboards for async review**: Write dashboard HTML to the results directory so reviewers can check progress without a live terminal session.
395
+ 7. **Cost tracking**: If running on cloud infrastructure, track estimated cost per run and alert when the cost budget is approaching its limit.
@@ -0,0 +1,306 @@
1
+ ---
2
+ name: research-overfitting-prevention
3
+ description: Out-of-sample validation, cross-validation strategies, statistical significance testing, and when to stop iterating to prevent overfitting
4
+ topics: [research, overfitting, validation, cross-validation, statistical-significance, out-of-sample]
5
+ ---
6
+
7
+ Overfitting is the central risk of iterative research. Every time an agent evaluates a hypothesis against data and uses the result to guide the next hypothesis, it is implicitly fitting to that data. After hundreds of iterations, even random strategies will appear to perform well on the evaluation set -- this is multiple comparisons bias. Preventing overfitting requires rigorous separation of training and evaluation data, statistical significance testing, and disciplined stopping criteria.
8
+
9
+ ## Summary
10
+
11
+ Split data into train, validation, and holdout sets. Use the validation set for iteration decisions (keep/discard) and reserve the holdout set for final evaluation only -- never let the holdout set influence any iteration decision. Apply cross-validation for small datasets. Use statistical significance tests (permutation tests, bootstrap confidence intervals) to verify that results are real, not noise. Stop iterating when the improvement per iteration falls below the noise floor.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Data Splitting for Research
16
+
17
+ The data split for research projects has three levels, not two:
18
+
19
+ ```
20
+ ┌─────────────────────────────────────────────────┐
21
+ │ Full Dataset │
22
+ ├──────────────────┬────────────┬─────────────────┤
23
+ │ Training Set │ Validation │ Holdout Set │
24
+ │ (60-70%) │ (15-20%) │ (15-20%) │
25
+ ├──────────────────┼────────────┼─────────────────┤
26
+ │ Strategy learns │ Keep/discard│ Final eval ONLY │
27
+ │ from this data │ decisions │ Touch once │
28
+ └──────────────────┴────────────┴─────────────────┘
29
+ ```
30
+
31
+ **Critical rule**: The holdout set is touched exactly once -- at the very end of the research project, to report final results. If the holdout set is used to make any iteration decision, it becomes a validation set and loses its value.
32
+
33
+ ```python
34
+ # src/data/splitter.py
35
+ import numpy as np
36
+ from dataclasses import dataclass
37
+ from typing import Any
38
+
39
+ @dataclass
40
+ class DataSplit:
41
+ """Three-way data split for research projects."""
42
+ train: Any
43
+ validation: Any
44
+ holdout: Any
45
+
46
+ def temporal_split(data: np.ndarray, train_frac: float = 0.6,
47
+ val_frac: float = 0.2) -> DataSplit:
48
+ """
49
+ Temporal split for time-series data.
50
+ MUST be chronological -- never shuffle time-series data.
51
+ """
52
+ n = len(data)
53
+ train_end = int(n * train_frac)
54
+ val_end = int(n * (train_frac + val_frac))
55
+
56
+ return DataSplit(
57
+ train=data[:train_end],
58
+ validation=data[train_end:val_end],
59
+ holdout=data[val_end:],
60
+ )
61
+
62
+ def random_split(data: np.ndarray, train_frac: float = 0.6,
63
+ val_frac: float = 0.2, seed: int = 42) -> DataSplit:
64
+ """
65
+ Random split for non-temporal data.
66
+ Use when data points are independent (no time ordering).
67
+ """
68
+ rng = np.random.default_rng(seed)
69
+ indices = rng.permutation(len(data))
70
+ n = len(data)
71
+ train_end = int(n * train_frac)
72
+ val_end = int(n * (train_frac + val_frac))
73
+
74
+ return DataSplit(
75
+ train=data[indices[:train_end]],
76
+ validation=data[indices[train_end:val_end]],
77
+ holdout=data[indices[val_end:]],
78
+ )
79
+ ```
80
+
81
+ ### Walk-Forward Validation (Time Series)
82
+
83
+ For time-series research (trading strategies, forecasting), use walk-forward validation instead of random cross-validation:
84
+
85
+ ```python
86
+ # src/evaluation/walk_forward.py
87
+ import numpy as np
88
+ from dataclasses import dataclass
89
+
90
+ @dataclass
91
+ class WalkForwardWindow:
92
+ train_start: int
93
+ train_end: int
94
+ test_start: int
95
+ test_end: int
96
+
97
+ def walk_forward_splits(n_samples: int, train_window: int,
98
+ test_window: int, step: int | None = None
99
+ ) -> list[WalkForwardWindow]:
100
+ """
101
+ Generate walk-forward validation windows.
102
+
103
+ Produces rolling train/test splits that move forward in time:
104
+ [train_0][test_0]
105
+ [train_1][test_1]
106
+ [train_2][test_2]
107
+ """
108
+ if step is None:
109
+ step = test_window
110
+
111
+ windows = []
112
+ start = 0
113
+ while start + train_window + test_window <= n_samples:
114
+ windows.append(WalkForwardWindow(
115
+ train_start=start,
116
+ train_end=start + train_window,
117
+ test_start=start + train_window,
118
+ test_end=start + train_window + test_window,
119
+ ))
120
+ start += step
121
+
122
+ return windows
123
+
124
+ def walk_forward_evaluate(strategy, data, train_window: int = 252,
125
+ test_window: int = 63) -> list[dict]:
126
+ """
127
+ Evaluate a strategy using walk-forward analysis.
128
+ Returns metrics for each window.
129
+ """
130
+ windows = walk_forward_splits(len(data), train_window, test_window)
131
+ results = []
132
+ for w in windows:
133
+ train_data = data[w.train_start:w.train_end]
134
+ test_data = data[w.test_start:w.test_end]
135
+
136
+ strategy.fit(train_data)
137
+ metrics = strategy.evaluate(test_data)
138
+ results.append({
139
+ "window": f"{w.test_start}-{w.test_end}",
140
+ **metrics,
141
+ })
142
+
143
+ return results
144
+ ```
145
+
146
+ ### Cross-Validation for Small Datasets
147
+
148
+ When the dataset is too small for a three-way split, use k-fold cross-validation on the train+validation portion, keeping the holdout untouched:
149
+
150
+ ```python
151
+ # src/evaluation/cross_validation.py
152
+ import numpy as np
153
+
154
+ def stratified_kfold_evaluate(strategy_factory, data, labels,
155
+ k: int = 5, seed: int = 42) -> dict:
156
+ """
157
+ K-fold cross-validation with stratified splits.
158
+ Returns mean and std of metrics across folds.
159
+ """
160
+ rng = np.random.default_rng(seed)
161
+ indices = rng.permutation(len(data))
162
+ fold_size = len(data) // k
163
+
164
+ all_metrics = []
165
+ for i in range(k):
166
+ test_idx = indices[i * fold_size:(i + 1) * fold_size]
167
+ train_idx = np.concatenate([
168
+ indices[:i * fold_size],
169
+ indices[(i + 1) * fold_size:],
170
+ ])
171
+
172
+ strategy = strategy_factory() # Fresh instance per fold
173
+ strategy.fit(data[train_idx], labels[train_idx])
174
+ metrics = strategy.evaluate(data[test_idx], labels[test_idx])
175
+ all_metrics.append(metrics)
176
+
177
+ # Aggregate across folds
178
+ metric_names = all_metrics[0].keys()
179
+ return {
180
+ name: {
181
+ "mean": np.mean([m[name] for m in all_metrics]),
182
+ "std": np.std([m[name] for m in all_metrics]),
183
+ "per_fold": [m[name] for m in all_metrics],
184
+ }
185
+ for name in metric_names
186
+ }
187
+ ```
188
+
189
+ ### Statistical Significance Testing
190
+
191
+ After hundreds of iterations, a strategy that appears to beat the baseline may be a statistical artifact. Test significance before accepting:
192
+
193
+ ```python
194
+ # src/evaluation/statistical.py
195
+ import numpy as np
196
+
197
+ def permutation_test(strategy_returns: np.ndarray, baseline_returns: np.ndarray,
198
+ n_permutations: int = 10000, seed: int = 42) -> dict:
199
+ """
200
+ Permutation test for difference in mean returns.
201
+ Tests H0: strategy and baseline come from the same distribution.
202
+ """
203
+ rng = np.random.default_rng(seed)
204
+ observed_diff = strategy_returns.mean() - baseline_returns.mean()
205
+
206
+ combined = np.concatenate([strategy_returns, baseline_returns])
207
+ n_strategy = len(strategy_returns)
208
+
209
+ count_extreme = 0
210
+ for _ in range(n_permutations):
211
+ perm = rng.permutation(combined)
212
+ perm_diff = perm[:n_strategy].mean() - perm[n_strategy:].mean()
213
+ if perm_diff >= observed_diff:
214
+ count_extreme += 1
215
+
216
+ p_value = (count_extreme + 1) / (n_permutations + 1)
217
+
218
+ return {
219
+ "observed_difference": float(observed_diff),
220
+ "p_value": float(p_value),
221
+ "significant_at_005": p_value < 0.05,
222
+ "significant_at_001": p_value < 0.01,
223
+ "n_permutations": n_permutations,
224
+ }
225
+
226
+ def bootstrap_confidence_interval(values: np.ndarray, statistic=np.mean,
227
+ confidence: float = 0.95,
228
+ n_bootstrap: int = 10000,
229
+ seed: int = 42) -> dict:
230
+ """
231
+ Bootstrap confidence interval for a statistic.
232
+ Use to estimate uncertainty on experiment metrics.
233
+ """
234
+ rng = np.random.default_rng(seed)
235
+ bootstrap_stats = []
236
+ for _ in range(n_bootstrap):
237
+ sample = rng.choice(values, size=len(values), replace=True)
238
+ bootstrap_stats.append(statistic(sample))
239
+
240
+ bootstrap_stats = np.array(bootstrap_stats)
241
+ alpha = (1 - confidence) / 2
242
+ lower = np.percentile(bootstrap_stats, 100 * alpha)
243
+ upper = np.percentile(bootstrap_stats, 100 * (1 - alpha))
244
+
245
+ return {
246
+ "point_estimate": float(statistic(values)),
247
+ "lower": float(lower),
248
+ "upper": float(upper),
249
+ "confidence": confidence,
250
+ }
251
+ ```
252
+
253
+ ### Multiple Comparisons Correction
254
+
255
+ When testing many hypotheses, the probability of at least one false positive increases. Correct for this:
256
+
257
+ ```python
258
+ def bonferroni_threshold(base_alpha: float, n_comparisons: int) -> float:
259
+ """
260
+ Bonferroni correction: divide alpha by number of comparisons.
261
+ Conservative but simple.
262
+ """
263
+ return base_alpha / n_comparisons
264
+
265
+ def holm_bonferroni(p_values: list[float], alpha: float = 0.05) -> list[bool]:
266
+ """
267
+ Holm-Bonferroni step-down procedure.
268
+ Less conservative than Bonferroni while controlling family-wise error.
269
+ """
270
+ n = len(p_values)
271
+ sorted_indices = np.argsort(p_values)
272
+ sorted_pvals = np.array(p_values)[sorted_indices]
273
+
274
+ significant = [False] * n
275
+ for i, (idx, pval) in enumerate(zip(sorted_indices, sorted_pvals)):
276
+ adjusted_alpha = alpha / (n - i)
277
+ if pval <= adjusted_alpha:
278
+ significant[idx] = True
279
+ else:
280
+ break # Stop at first non-rejection
281
+
282
+ return significant
283
+ ```
284
+
285
+ ### When to Stop Iterating
286
+
287
+ Practical decision framework:
288
+
289
+ | Signal | Action | Example |
290
+ |--------|--------|---------|
291
+ | Primary metric met target | Stop, run holdout eval | Sharpe > 1.5 on validation |
292
+ | Convergence detected | Stop, run holdout eval | Mean Sharpe unchanged for 50 runs |
293
+ | Budget exhausted | Stop, report best result | 500 runs completed |
294
+ | All improvements not significant | Stop, report negative result | p > 0.05 for all improvements |
295
+ | Validation improving but train degrading | Investigate -- possible bug | Opposite curves on train/val |
296
+ | Holdout result much worse than validation | Report overfitting, do not deploy | Sharpe 1.5 val, 0.3 holdout |
297
+
298
+ ### Overfitting Red Flags
299
+
300
+ Watch for these warning signs during iteration:
301
+
302
+ 1. **Validation metric much better than cross-validation mean**: The specific validation split may be easy. Use CV to get a robust estimate.
303
+ 2. **Improvement from many small parameters**: Complex models with many tuned parameters are more likely to overfit than simple models.
304
+ 3. **Results sensitive to data ordering**: If shuffling the validation set changes the result significantly, the sample size is too small.
305
+ 4. **Monotonically improving metrics across iterations**: Real research has noise. If every iteration is better than the last, something is leaking.
306
+ 5. **Results do not replicate across time periods**: A strategy that works on 2020-2022 but fails on 2023 is likely overfit to the training period.