@zigrivers/scaffold 3.14.0 → 3.16.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +50 -21
- package/content/knowledge/core/automated-review-tooling.md +21 -26
- package/content/knowledge/core/multi-model-review-dispatch.md +30 -55
- package/content/knowledge/research/research-architecture.md +385 -0
- package/content/knowledge/research/research-conventions.md +248 -0
- package/content/knowledge/research/research-dev-environment.md +303 -0
- package/content/knowledge/research/research-experiment-loop.md +429 -0
- package/content/knowledge/research/research-experiment-tracking.md +336 -0
- package/content/knowledge/research/research-ml-architecture-search.md +383 -0
- package/content/knowledge/research/research-ml-evaluation.md +407 -0
- package/content/knowledge/research/research-ml-experiment-tracking.md +466 -0
- package/content/knowledge/research/research-ml-training-patterns.md +413 -0
- package/content/knowledge/research/research-observability.md +395 -0
- package/content/knowledge/research/research-overfitting-prevention.md +306 -0
- package/content/knowledge/research/research-project-structure.md +264 -0
- package/content/knowledge/research/research-quant-backtesting.md +326 -0
- package/content/knowledge/research/research-quant-market-data.md +366 -0
- package/content/knowledge/research/research-quant-metrics.md +335 -0
- package/content/knowledge/research/research-quant-requirements.md +223 -0
- package/content/knowledge/research/research-quant-risk.md +469 -0
- package/content/knowledge/research/research-quant-strategy-patterns.md +412 -0
- package/content/knowledge/research/research-requirements.md +201 -0
- package/content/knowledge/research/research-security.md +374 -0
- package/content/knowledge/research/research-sim-compute-management.md +538 -0
- package/content/knowledge/research/research-sim-engine-patterns.md +448 -0
- package/content/knowledge/research/research-sim-parameter-spaces.md +425 -0
- package/content/knowledge/research/research-sim-validation.md +456 -0
- package/content/knowledge/research/research-testing.md +334 -0
- package/content/methodology/research-ml-research.yml +23 -0
- package/content/methodology/research-overlay.yml +65 -0
- package/content/methodology/research-quant-finance.yml +29 -0
- package/content/methodology/research-simulation.yml +23 -0
- package/content/tools/post-implementation-review.md +36 -7
- package/content/tools/review-code.md +33 -8
- package/content/tools/review-pr.md +79 -95
- package/dist/cli/commands/adopt.d.ts.map +1 -1
- package/dist/cli/commands/adopt.js +22 -1
- package/dist/cli/commands/adopt.js.map +1 -1
- package/dist/cli/commands/adopt.serialization.test.js +41 -0
- package/dist/cli/commands/adopt.serialization.test.js.map +1 -1
- package/dist/cli/commands/init.d.ts +4 -0
- package/dist/cli/commands/init.d.ts.map +1 -1
- package/dist/cli/commands/init.js +32 -2
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/cli/init-flag-families.d.ts +6 -1
- package/dist/cli/init-flag-families.d.ts.map +1 -1
- package/dist/cli/init-flag-families.js +32 -1
- package/dist/cli/init-flag-families.js.map +1 -1
- package/dist/cli/init-flag-families.test.js +47 -0
- package/dist/cli/init-flag-families.test.js.map +1 -1
- package/dist/config/schema.d.ts +272 -16
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +25 -1
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +103 -3
- package/dist/config/schema.test.js.map +1 -1
- package/dist/core/assembly/overlay-loader.d.ts +12 -0
- package/dist/core/assembly/overlay-loader.d.ts.map +1 -1
- package/dist/core/assembly/overlay-loader.js +30 -0
- package/dist/core/assembly/overlay-loader.js.map +1 -1
- package/dist/core/assembly/overlay-loader.test.js +66 -1
- package/dist/core/assembly/overlay-loader.test.js.map +1 -1
- package/dist/core/assembly/overlay-state-resolver.d.ts.map +1 -1
- package/dist/core/assembly/overlay-state-resolver.js +48 -19
- package/dist/core/assembly/overlay-state-resolver.js.map +1 -1
- package/dist/core/assembly/overlay-state-resolver.test.js +80 -0
- package/dist/core/assembly/overlay-state-resolver.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.js +119 -0
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/project/adopt.d.ts.map +1 -1
- package/dist/project/adopt.js +3 -1
- package/dist/project/adopt.js.map +1 -1
- package/dist/project/detectors/disambiguate.js +1 -1
- package/dist/project/detectors/disambiguate.js.map +1 -1
- package/dist/project/detectors/index.d.ts.map +1 -1
- package/dist/project/detectors/index.js +2 -1
- package/dist/project/detectors/index.js.map +1 -1
- package/dist/project/detectors/ml.d.ts.map +1 -1
- package/dist/project/detectors/ml.js +2 -6
- package/dist/project/detectors/ml.js.map +1 -1
- package/dist/project/detectors/research.d.ts +4 -0
- package/dist/project/detectors/research.d.ts.map +1 -0
- package/dist/project/detectors/research.js +141 -0
- package/dist/project/detectors/research.js.map +1 -0
- package/dist/project/detectors/research.test.d.ts +2 -0
- package/dist/project/detectors/research.test.d.ts.map +1 -0
- package/dist/project/detectors/research.test.js +235 -0
- package/dist/project/detectors/research.test.js.map +1 -0
- package/dist/project/detectors/shared-signals.d.ts +3 -0
- package/dist/project/detectors/shared-signals.d.ts.map +1 -0
- package/dist/project/detectors/shared-signals.js +9 -0
- package/dist/project/detectors/shared-signals.js.map +1 -0
- package/dist/project/detectors/types.d.ts +6 -2
- package/dist/project/detectors/types.d.ts.map +1 -1
- package/dist/project/detectors/types.js.map +1 -1
- package/dist/types/config.d.ts +7 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/copy/core.d.ts.map +1 -1
- package/dist/wizard/copy/core.js +4 -0
- package/dist/wizard/copy/core.js.map +1 -1
- package/dist/wizard/copy/index.d.ts.map +1 -1
- package/dist/wizard/copy/index.js +2 -0
- package/dist/wizard/copy/index.js.map +1 -1
- package/dist/wizard/copy/research.d.ts +3 -0
- package/dist/wizard/copy/research.d.ts.map +1 -0
- package/dist/wizard/copy/research.js +27 -0
- package/dist/wizard/copy/research.js.map +1 -0
- package/dist/wizard/copy/types.d.ts +5 -1
- package/dist/wizard/copy/types.d.ts.map +1 -1
- package/dist/wizard/flags.d.ts +7 -1
- package/dist/wizard/flags.d.ts.map +1 -1
- package/dist/wizard/questions.d.ts +4 -2
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +27 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +51 -0
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts +3 -2
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +3 -1
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,334 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: research-testing
|
|
3
|
+
description: Testing experiment loops including determinism tests, result validation, integration tests for experiment pipelines, and regression baselines
|
|
4
|
+
topics: [research, testing, determinism, validation, integration-tests, regression, tdd]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Research code is notoriously undertested because "the results are stochastic" feels like an excuse. It is not. The experiment runner, evaluation framework, data pipeline, and state management are all deterministic and must be tested rigorously. The stochastic parts (experiment outcomes) require seed-based determinism tests and statistical validation. Untested experiment loops produce unreliable results that waste compute and mislead researchers.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Test research projects at four levels: determinism tests (same seed produces same results), component tests (runner, evaluator, tracker work correctly in isolation), integration tests (full experiment loop produces valid results on fixture data), and regression tests (new code changes do not alter previously established baselines). Use pytest with fixtures for small datasets and mocked external dependencies. Run tests on every commit -- fast tests in pre-commit, slow integration tests in CI.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Determinism Tests
|
|
16
|
+
|
|
17
|
+
The most important property of a research system: given the same seed and config, it must produce identical results:
|
|
18
|
+
|
|
19
|
+
```python
|
|
20
|
+
# tests/test_determinism.py
|
|
21
|
+
import pytest
|
|
22
|
+
from src.runner.experiment_runner import ExperimentRunner
|
|
23
|
+
from src.seed import set_seed
|
|
24
|
+
|
|
25
|
+
class TestDeterminism:
|
|
26
|
+
def test_same_seed_same_results(self, tmp_path, fixture_config):
|
|
27
|
+
"""Two runs with the same seed must produce identical metrics."""
|
|
28
|
+
config = fixture_config.copy()
|
|
29
|
+
config["experiment"]["seed"] = 42
|
|
30
|
+
config["logging"]["results_dir"] = str(tmp_path)
|
|
31
|
+
|
|
32
|
+
# Run 1
|
|
33
|
+
set_seed(42)
|
|
34
|
+
runner1 = ExperimentRunner(config)
|
|
35
|
+
result1 = runner1.run_single()
|
|
36
|
+
|
|
37
|
+
# Run 2
|
|
38
|
+
set_seed(42)
|
|
39
|
+
runner2 = ExperimentRunner(config)
|
|
40
|
+
result2 = runner2.run_single()
|
|
41
|
+
|
|
42
|
+
assert result1.metrics == result2.metrics, (
|
|
43
|
+
f"Non-deterministic results:\n"
|
|
44
|
+
f" Run 1: {result1.metrics}\n"
|
|
45
|
+
f" Run 2: {result2.metrics}"
|
|
46
|
+
)
|
|
47
|
+
|
|
48
|
+
def test_different_seeds_different_results(self, tmp_path, fixture_config):
|
|
49
|
+
"""Different seeds should produce different results (not trivially constant)."""
|
|
50
|
+
config = fixture_config.copy()
|
|
51
|
+
config["logging"]["results_dir"] = str(tmp_path)
|
|
52
|
+
|
|
53
|
+
set_seed(42)
|
|
54
|
+
runner1 = ExperimentRunner(config)
|
|
55
|
+
result1 = runner1.run_single()
|
|
56
|
+
|
|
57
|
+
set_seed(123)
|
|
58
|
+
runner2 = ExperimentRunner(config)
|
|
59
|
+
result2 = runner2.run_single()
|
|
60
|
+
|
|
61
|
+
assert result1.metrics != result2.metrics, (
|
|
62
|
+
"Different seeds produced identical results -- "
|
|
63
|
+
"strategy may be ignoring the seed"
|
|
64
|
+
)
|
|
65
|
+
|
|
66
|
+
def test_seed_isolation_between_runs(self, tmp_path, fixture_config):
|
|
67
|
+
"""Each run in the loop must use an independent seed."""
|
|
68
|
+
config = fixture_config.copy()
|
|
69
|
+
config["experiment"]["seed"] = 42
|
|
70
|
+
config["experiment"]["num_runs"] = 5
|
|
71
|
+
config["logging"]["results_dir"] = str(tmp_path)
|
|
72
|
+
|
|
73
|
+
runner = ExperimentRunner(config)
|
|
74
|
+
state = runner.run_loop()
|
|
75
|
+
|
|
76
|
+
# Verify all runs produced different metrics (not re-using the same seed)
|
|
77
|
+
metric_values = [r["metrics"]["primary"] for r in state.history]
|
|
78
|
+
assert len(set(str(v) for v in metric_values)) > 1, (
|
|
79
|
+
"All runs produced identical metrics -- seed may not be incremented"
|
|
80
|
+
)
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### Component Tests
|
|
84
|
+
|
|
85
|
+
Test each component of the experiment system in isolation:
|
|
86
|
+
|
|
87
|
+
```python
|
|
88
|
+
# tests/test_evaluator.py
|
|
89
|
+
import pytest
|
|
90
|
+
from src.evaluation.evaluator import MetricEvaluator
|
|
91
|
+
|
|
92
|
+
class TestMetricEvaluator:
|
|
93
|
+
@pytest.fixture
|
|
94
|
+
def evaluator(self):
|
|
95
|
+
return MetricEvaluator(
|
|
96
|
+
primary_metric="sharpe_ratio",
|
|
97
|
+
direction="maximize",
|
|
98
|
+
)
|
|
99
|
+
|
|
100
|
+
def test_evaluate_returns_expected_metrics(self, evaluator):
|
|
101
|
+
"""Evaluator must return all configured metrics."""
|
|
102
|
+
raw_results = {
|
|
103
|
+
"returns": [0.01, -0.005, 0.02, -0.01, 0.015],
|
|
104
|
+
"trades": 5,
|
|
105
|
+
}
|
|
106
|
+
metrics = evaluator.evaluate(raw_results)
|
|
107
|
+
assert "sharpe_ratio" in metrics
|
|
108
|
+
assert "max_drawdown" in metrics
|
|
109
|
+
assert "num_trades" in metrics
|
|
110
|
+
assert isinstance(metrics["sharpe_ratio"], float)
|
|
111
|
+
|
|
112
|
+
def test_is_improvement_maximization(self, evaluator):
|
|
113
|
+
"""Higher primary metric should be an improvement when maximizing."""
|
|
114
|
+
current = {"sharpe_ratio": 1.5, "max_drawdown": 0.1}
|
|
115
|
+
best = {"sharpe_ratio": 1.2, "max_drawdown": 0.12}
|
|
116
|
+
assert evaluator.is_improvement(current, best) is True
|
|
117
|
+
|
|
118
|
+
def test_is_not_improvement(self, evaluator):
|
|
119
|
+
"""Lower primary metric should not be an improvement when maximizing."""
|
|
120
|
+
current = {"sharpe_ratio": 1.0, "max_drawdown": 0.1}
|
|
121
|
+
best = {"sharpe_ratio": 1.5, "max_drawdown": 0.12}
|
|
122
|
+
assert evaluator.is_improvement(current, best) is False
|
|
123
|
+
|
|
124
|
+
def test_evaluate_empty_results_raises(self, evaluator):
|
|
125
|
+
"""Empty results must raise a clear error, not return NaN."""
|
|
126
|
+
with pytest.raises(ValueError, match="empty"):
|
|
127
|
+
evaluator.evaluate({"returns": [], "trades": 0})
|
|
128
|
+
|
|
129
|
+
|
|
130
|
+
# tests/test_state.py
|
|
131
|
+
import pytest
|
|
132
|
+
import json
|
|
133
|
+
from pathlib import Path
|
|
134
|
+
from src.runner.state import ExperimentState, RunRecord
|
|
135
|
+
|
|
136
|
+
class TestExperimentState:
|
|
137
|
+
def test_save_and_load_roundtrip(self, tmp_path):
|
|
138
|
+
"""State must survive a save/load cycle."""
|
|
139
|
+
state = ExperimentState(experiment_id="test-001")
|
|
140
|
+
run = RunRecord(
|
|
141
|
+
run_id="run-0001",
|
|
142
|
+
config={"strategy": {"type": "momentum"}},
|
|
143
|
+
metrics={"sharpe_ratio": 1.5},
|
|
144
|
+
is_best=True,
|
|
145
|
+
decision="keep",
|
|
146
|
+
)
|
|
147
|
+
state.record_run(run)
|
|
148
|
+
|
|
149
|
+
path = tmp_path / "state.json"
|
|
150
|
+
state.save(path)
|
|
151
|
+
loaded = ExperimentState.load(path)
|
|
152
|
+
|
|
153
|
+
assert loaded.experiment_id == "test-001"
|
|
154
|
+
assert loaded.total_runs == 1
|
|
155
|
+
assert loaded.best_run.metrics == {"sharpe_ratio": 1.5}
|
|
156
|
+
|
|
157
|
+
def test_runs_since_improvement_tracking(self):
|
|
158
|
+
"""State must track runs since last improvement."""
|
|
159
|
+
state = ExperimentState(experiment_id="test")
|
|
160
|
+
|
|
161
|
+
# First run is always best
|
|
162
|
+
state.record_run(RunRecord(
|
|
163
|
+
run_id="1", config={}, metrics={"m": 1.0}, is_best=True, decision="keep",
|
|
164
|
+
))
|
|
165
|
+
assert state.runs_since_improvement == 0
|
|
166
|
+
|
|
167
|
+
# Non-improvement increments counter
|
|
168
|
+
state.record_run(RunRecord(
|
|
169
|
+
run_id="2", config={}, metrics={"m": 0.5}, is_best=False, decision="discard",
|
|
170
|
+
))
|
|
171
|
+
assert state.runs_since_improvement == 1
|
|
172
|
+
|
|
173
|
+
# New best resets counter
|
|
174
|
+
state.record_run(RunRecord(
|
|
175
|
+
run_id="3", config={}, metrics={"m": 2.0}, is_best=True, decision="keep",
|
|
176
|
+
))
|
|
177
|
+
assert state.runs_since_improvement == 0
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
### Integration Tests
|
|
181
|
+
|
|
182
|
+
Integration tests run the full experiment loop on small fixture data:
|
|
183
|
+
|
|
184
|
+
```python
|
|
185
|
+
# tests/test_integration.py
|
|
186
|
+
import pytest
|
|
187
|
+
from pathlib import Path
|
|
188
|
+
from src.runner.experiment_runner import ExperimentRunner
|
|
189
|
+
from src.loop.state_machine import ExperimentLoop, LoopState
|
|
190
|
+
|
|
191
|
+
class TestExperimentLoopIntegration:
|
|
192
|
+
@pytest.fixture
|
|
193
|
+
def small_config(self, tmp_path):
|
|
194
|
+
return {
|
|
195
|
+
"experiment": {"seed": 42, "num_runs": 10},
|
|
196
|
+
"strategy": {"type": "mock_strategy", "params": {}},
|
|
197
|
+
"budget": {"max_runs": 10, "patience": 5},
|
|
198
|
+
"logging": {"results_dir": str(tmp_path / "results")},
|
|
199
|
+
}
|
|
200
|
+
|
|
201
|
+
def test_loop_runs_to_completion(self, small_config, tmp_path):
|
|
202
|
+
"""Loop must complete within budget and produce valid state."""
|
|
203
|
+
runner = ExperimentRunner(small_config)
|
|
204
|
+
state = runner.run_loop()
|
|
205
|
+
|
|
206
|
+
assert state.total_runs <= 10
|
|
207
|
+
assert state.best_run is not None
|
|
208
|
+
assert len(state.history) == state.total_runs
|
|
209
|
+
|
|
210
|
+
def test_loop_persists_state(self, small_config, tmp_path):
|
|
211
|
+
"""State file must exist and be loadable after loop completes."""
|
|
212
|
+
runner = ExperimentRunner(small_config)
|
|
213
|
+
runner.run_loop()
|
|
214
|
+
|
|
215
|
+
state_path = Path(small_config["logging"]["results_dir"]) / "state.json"
|
|
216
|
+
assert state_path.exists()
|
|
217
|
+
|
|
218
|
+
loaded = LoopState.load(state_path)
|
|
219
|
+
assert loaded.iteration > 0
|
|
220
|
+
|
|
221
|
+
def test_loop_resume_after_interruption(self, small_config, tmp_path):
|
|
222
|
+
"""Loop must resume correctly from persisted state."""
|
|
223
|
+
config = small_config.copy()
|
|
224
|
+
config["budget"]["max_runs"] = 20
|
|
225
|
+
|
|
226
|
+
# Run 10 iterations
|
|
227
|
+
runner1 = ExperimentRunner(config)
|
|
228
|
+
runner1.budget.max_runs = 10
|
|
229
|
+
state1 = runner1.run_loop()
|
|
230
|
+
assert state1.total_runs == 10
|
|
231
|
+
|
|
232
|
+
# Resume from saved state, run 10 more
|
|
233
|
+
runner2 = ExperimentRunner(config)
|
|
234
|
+
state2 = runner2.run_loop()
|
|
235
|
+
assert state2.total_runs == 20
|
|
236
|
+
|
|
237
|
+
def test_results_directory_structure(self, small_config, tmp_path):
|
|
238
|
+
"""Each run must create the expected result files."""
|
|
239
|
+
runner = ExperimentRunner(small_config)
|
|
240
|
+
runner.run_loop()
|
|
241
|
+
|
|
242
|
+
results_dir = Path(small_config["logging"]["results_dir"])
|
|
243
|
+
run_dirs = sorted(d for d in results_dir.iterdir()
|
|
244
|
+
if d.is_dir() and d.name.startswith("run-"))
|
|
245
|
+
|
|
246
|
+
assert len(run_dirs) > 0
|
|
247
|
+
for run_dir in run_dirs:
|
|
248
|
+
assert (run_dir / "config.json").exists()
|
|
249
|
+
assert (run_dir / "metrics.json").exists()
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
### Regression Baselines
|
|
253
|
+
|
|
254
|
+
Establish metric baselines so that code changes do not silently degrade results:
|
|
255
|
+
|
|
256
|
+
```python
|
|
257
|
+
# tests/test_regression.py
|
|
258
|
+
import pytest
|
|
259
|
+
import json
|
|
260
|
+
from pathlib import Path
|
|
261
|
+
|
|
262
|
+
BASELINE_PATH = Path("tests/fixtures/baselines/metrics_baseline.json")
|
|
263
|
+
|
|
264
|
+
class TestRegressionBaseline:
|
|
265
|
+
@pytest.fixture(scope="class")
|
|
266
|
+
def current_metrics(self, small_config, tmp_path):
|
|
267
|
+
"""Run the standard benchmark and return metrics."""
|
|
268
|
+
from src.runner.experiment_runner import ExperimentRunner
|
|
269
|
+
runner = ExperimentRunner(small_config)
|
|
270
|
+
state = runner.run_loop()
|
|
271
|
+
return state.best_run.metrics
|
|
272
|
+
|
|
273
|
+
@pytest.fixture(scope="class")
|
|
274
|
+
def baseline_metrics(self):
|
|
275
|
+
"""Load the committed baseline metrics."""
|
|
276
|
+
with open(BASELINE_PATH) as f:
|
|
277
|
+
return json.load(f)
|
|
278
|
+
|
|
279
|
+
def test_primary_metric_no_regression(self, current_metrics, baseline_metrics):
|
|
280
|
+
"""Primary metric must not regress beyond tolerance."""
|
|
281
|
+
tolerance = 0.05 # 5% relative tolerance
|
|
282
|
+
baseline = baseline_metrics["sharpe_ratio"]
|
|
283
|
+
current = current_metrics["sharpe_ratio"]
|
|
284
|
+
assert current >= baseline * (1 - tolerance), (
|
|
285
|
+
f"Regression: sharpe_ratio {current:.4f} < baseline {baseline:.4f} "
|
|
286
|
+
f"(tolerance: {tolerance:.0%})"
|
|
287
|
+
)
|
|
288
|
+
```
|
|
289
|
+
|
|
290
|
+
### Test Fixtures
|
|
291
|
+
|
|
292
|
+
```python
|
|
293
|
+
# tests/conftest.py
|
|
294
|
+
import pytest
|
|
295
|
+
|
|
296
|
+
@pytest.fixture
|
|
297
|
+
def fixture_config(tmp_path):
|
|
298
|
+
"""Minimal config for fast tests."""
|
|
299
|
+
return {
|
|
300
|
+
"experiment": {"seed": 42, "num_runs": 5},
|
|
301
|
+
"strategy": {"type": "mock_strategy", "params": {}},
|
|
302
|
+
"data": {"source": "tests/fixtures/small_data.csv"},
|
|
303
|
+
"budget": {"max_runs": 5, "patience": 3},
|
|
304
|
+
"logging": {"results_dir": str(tmp_path / "results")},
|
|
305
|
+
}
|
|
306
|
+
|
|
307
|
+
@pytest.fixture
|
|
308
|
+
def mock_strategy():
|
|
309
|
+
"""Strategy that returns predictable results for testing."""
|
|
310
|
+
class MockStrategy:
|
|
311
|
+
name = "mock_strategy"
|
|
312
|
+
_call_count = 0
|
|
313
|
+
|
|
314
|
+
def execute(self, config):
|
|
315
|
+
self._call_count += 1
|
|
316
|
+
return {
|
|
317
|
+
"returns": [0.01 * self._call_count, -0.005, 0.02],
|
|
318
|
+
"trades": self._call_count * 10,
|
|
319
|
+
}
|
|
320
|
+
|
|
321
|
+
def next_hypothesis(self, state):
|
|
322
|
+
return {"param": state.iteration}
|
|
323
|
+
|
|
324
|
+
return MockStrategy()
|
|
325
|
+
```
|
|
326
|
+
|
|
327
|
+
### Testing Best Practices for Research
|
|
328
|
+
|
|
329
|
+
- **Fast tests in pre-commit**: Determinism and component tests must run in < 10 seconds.
|
|
330
|
+
- **Slow tests in CI**: Integration tests with actual experiment execution run in CI only (mark with `@pytest.mark.slow`).
|
|
331
|
+
- **Mock external resources**: Mock file I/O, API calls, and database connections in unit tests. Integration tests may use real file I/O with `tmp_path`.
|
|
332
|
+
- **Test the loop termination**: Verify that every stopping condition actually stops the loop. Budget exhaustion, patience, convergence, and error limits must all be tested.
|
|
333
|
+
- **Test crash recovery**: Simulate a crash by persisting state mid-loop, then verify the loop resumes correctly.
|
|
334
|
+
- **Baseline updates are deliberate**: Updating regression baselines requires a commit message explaining why the baseline changed. Never auto-update baselines in CI.
|
|
@@ -0,0 +1,23 @@
|
|
|
1
|
+
# methodology/research-ml-research.yml
|
|
2
|
+
name: research-ml-research
|
|
3
|
+
description: >
|
|
4
|
+
ML-research domain sub-overlay — adds architecture search, training
|
|
5
|
+
patterns, and evaluation knowledge for ML research projects.
|
|
6
|
+
project-type: research
|
|
7
|
+
domain: ml-research
|
|
8
|
+
|
|
9
|
+
knowledge-overrides:
|
|
10
|
+
system-architecture:
|
|
11
|
+
append: [research-ml-architecture-search, research-ml-training-patterns]
|
|
12
|
+
operations:
|
|
13
|
+
append: [research-ml-experiment-tracking]
|
|
14
|
+
tdd:
|
|
15
|
+
append: [research-ml-evaluation]
|
|
16
|
+
create-evals:
|
|
17
|
+
append: [research-ml-evaluation]
|
|
18
|
+
review-architecture:
|
|
19
|
+
append: [research-ml-architecture-search]
|
|
20
|
+
review-testing:
|
|
21
|
+
append: [research-ml-evaluation]
|
|
22
|
+
implementation-plan:
|
|
23
|
+
append: [research-ml-architecture-search]
|
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
# methodology/research-overlay.yml
|
|
2
|
+
name: research
|
|
3
|
+
description: >
|
|
4
|
+
Research overlay — injects research domain knowledge into existing
|
|
5
|
+
pipeline steps for experiment loop architecture, tracking, evaluation,
|
|
6
|
+
overfitting prevention, and domain-specific patterns.
|
|
7
|
+
project-type: research
|
|
8
|
+
|
|
9
|
+
# ---------------------------------------------------------------------------
|
|
10
|
+
# knowledge-overrides
|
|
11
|
+
# ---------------------------------------------------------------------------
|
|
12
|
+
# Map research knowledge entries into existing pipeline steps so that
|
|
13
|
+
# experiment loop domain expertise is injected during prompt assembly.
|
|
14
|
+
knowledge-overrides:
|
|
15
|
+
# Foundational (6 steps)
|
|
16
|
+
create-prd:
|
|
17
|
+
append: [research-requirements]
|
|
18
|
+
user-stories:
|
|
19
|
+
append: [research-requirements]
|
|
20
|
+
coding-standards:
|
|
21
|
+
append: [research-conventions]
|
|
22
|
+
project-structure:
|
|
23
|
+
append: [research-project-structure]
|
|
24
|
+
dev-env-setup:
|
|
25
|
+
append: [research-dev-environment]
|
|
26
|
+
git-workflow:
|
|
27
|
+
append: [research-conventions]
|
|
28
|
+
|
|
29
|
+
# Architecture & Design (6 steps)
|
|
30
|
+
system-architecture:
|
|
31
|
+
append: [research-architecture, research-experiment-loop]
|
|
32
|
+
tech-stack:
|
|
33
|
+
append: [research-architecture]
|
|
34
|
+
adrs:
|
|
35
|
+
append: [research-architecture]
|
|
36
|
+
domain-modeling:
|
|
37
|
+
append: [research-experiment-loop]
|
|
38
|
+
security:
|
|
39
|
+
append: [research-security]
|
|
40
|
+
operations:
|
|
41
|
+
append: [research-experiment-tracking, research-observability]
|
|
42
|
+
|
|
43
|
+
# Testing (4 steps)
|
|
44
|
+
tdd:
|
|
45
|
+
append: [research-testing, research-overfitting-prevention]
|
|
46
|
+
add-e2e-testing:
|
|
47
|
+
append: [research-testing]
|
|
48
|
+
create-evals:
|
|
49
|
+
append: [research-testing, research-overfitting-prevention]
|
|
50
|
+
story-tests:
|
|
51
|
+
append: [research-testing]
|
|
52
|
+
|
|
53
|
+
# Reviews (4 steps)
|
|
54
|
+
review-architecture:
|
|
55
|
+
append: [research-architecture, research-experiment-loop]
|
|
56
|
+
review-security:
|
|
57
|
+
append: [research-security]
|
|
58
|
+
review-operations:
|
|
59
|
+
append: [research-experiment-tracking, research-observability]
|
|
60
|
+
review-testing:
|
|
61
|
+
append: [research-testing, research-overfitting-prevention]
|
|
62
|
+
|
|
63
|
+
# Planning (1 step)
|
|
64
|
+
implementation-plan:
|
|
65
|
+
append: [research-architecture]
|
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
# methodology/research-quant-finance.yml
|
|
2
|
+
name: research-quant-finance
|
|
3
|
+
description: >
|
|
4
|
+
Quant-finance domain sub-overlay — adds trading strategy, backtesting,
|
|
5
|
+
risk analysis, and market data knowledge to research projects.
|
|
6
|
+
project-type: research
|
|
7
|
+
domain: quant-finance
|
|
8
|
+
|
|
9
|
+
knowledge-overrides:
|
|
10
|
+
create-prd:
|
|
11
|
+
append: [research-quant-requirements]
|
|
12
|
+
system-architecture:
|
|
13
|
+
append: [research-quant-backtesting, research-quant-strategy-patterns]
|
|
14
|
+
domain-modeling:
|
|
15
|
+
append: [research-quant-market-data]
|
|
16
|
+
security:
|
|
17
|
+
append: [research-quant-risk]
|
|
18
|
+
operations:
|
|
19
|
+
append: [research-quant-metrics]
|
|
20
|
+
tdd:
|
|
21
|
+
append: [research-quant-backtesting]
|
|
22
|
+
create-evals:
|
|
23
|
+
append: [research-quant-metrics, research-quant-backtesting]
|
|
24
|
+
review-architecture:
|
|
25
|
+
append: [research-quant-backtesting, research-quant-strategy-patterns]
|
|
26
|
+
review-testing:
|
|
27
|
+
append: [research-quant-backtesting]
|
|
28
|
+
implementation-plan:
|
|
29
|
+
append: [research-quant-backtesting, research-quant-strategy-patterns]
|
|
@@ -0,0 +1,23 @@
|
|
|
1
|
+
# methodology/research-simulation.yml
|
|
2
|
+
name: research-simulation
|
|
3
|
+
description: >
|
|
4
|
+
Simulation domain sub-overlay — adds physics/materials simulation engine,
|
|
5
|
+
parameter space, and compute management knowledge.
|
|
6
|
+
project-type: research
|
|
7
|
+
domain: simulation
|
|
8
|
+
|
|
9
|
+
knowledge-overrides:
|
|
10
|
+
system-architecture:
|
|
11
|
+
append: [research-sim-engine-patterns, research-sim-parameter-spaces]
|
|
12
|
+
domain-modeling:
|
|
13
|
+
append: [research-sim-parameter-spaces]
|
|
14
|
+
operations:
|
|
15
|
+
append: [research-sim-compute-management]
|
|
16
|
+
tdd:
|
|
17
|
+
append: [research-sim-validation]
|
|
18
|
+
create-evals:
|
|
19
|
+
append: [research-sim-validation, research-sim-parameter-spaces]
|
|
20
|
+
review-architecture:
|
|
21
|
+
append: [research-sim-engine-patterns]
|
|
22
|
+
implementation-plan:
|
|
23
|
+
append: [research-sim-engine-patterns]
|
|
@@ -26,7 +26,7 @@ comprehensive quality check before releasing or handing off the project.
|
|
|
26
26
|
The three channels are:
|
|
27
27
|
1. **Codex CLI** — Implementation correctness, security, API contracts
|
|
28
28
|
2. **Gemini CLI** — Design reasoning, architectural patterns, broad context
|
|
29
|
-
3. **
|
|
29
|
+
3. **Claude CLI** — Plan alignment, code quality, testing
|
|
30
30
|
|
|
31
31
|
## Inputs
|
|
32
32
|
|
|
@@ -191,6 +191,7 @@ Return ALL findings as valid JSON:
|
|
|
191
191
|
{
|
|
192
192
|
"severity": "P0|P1|P2|P3",
|
|
193
193
|
"category": "architecture-alignment|security|error-handling|test-coverage|complexity|dependencies",
|
|
194
|
+
"location": "relative/path/to/file.ts:42",
|
|
194
195
|
"file": "relative/path/to/file.ts",
|
|
195
196
|
"line": 42,
|
|
196
197
|
"description": "Specific description of the issue",
|
|
@@ -226,7 +227,7 @@ If not installed: queue a compensating pass (implementation correctness, securit
|
|
|
226
227
|
codex login status 2>/dev/null && echo "codex authenticated" || echo "codex NOT authenticated"
|
|
227
228
|
```
|
|
228
229
|
|
|
229
|
-
If not authenticated: tell the user "Codex auth expired. Run: `! codex login`". Do NOT silently skip. Wait for re-auth and retry once. If auth cannot be recovered (
|
|
230
|
+
If not authenticated: tell the user "Codex auth expired. Run: `! codex login`". Do NOT silently skip. Wait for re-auth and retry once. If auth cannot be recovered (timeout or user declines): queue a compensating pass (implementation correctness, security, API contracts, labeled `[compensating: Codex-equivalent]`).
|
|
230
231
|
|
|
231
232
|
If Codex fails during execution (non-zero exit, malformed output, timeout): queue a compensating pass with the same focus and label.
|
|
232
233
|
|
|
@@ -254,7 +255,7 @@ If not installed: queue a compensating pass (architectural patterns, design reas
|
|
|
254
255
|
NO_BROWSER=true gemini -p "respond with ok" -o json 2>&1
|
|
255
256
|
```
|
|
256
257
|
|
|
257
|
-
If exit code is 41: tell the user "Gemini auth expired. Run: `! gemini -p \"hello\"`". Do NOT silently skip. Wait for re-auth and retry once. If auth cannot be recovered (
|
|
258
|
+
If exit code is 41: tell the user "Gemini auth expired. Run: `! gemini -p \"hello\"`". Do NOT silently skip. Wait for re-auth and retry once. If auth cannot be recovered (timeout or user declines): queue a compensating pass (architectural patterns, design reasoning, broad context, labeled `[compensating: Gemini-equivalent]`).
|
|
258
259
|
|
|
259
260
|
If Gemini fails during execution (non-zero exit, malformed output, timeout): queue a compensating pass with the same focus and label.
|
|
260
261
|
|
|
@@ -291,6 +292,7 @@ surfaces to this format before returning):
|
|
|
291
292
|
{
|
|
292
293
|
"severity": "P0|P1|P2|P3",
|
|
293
294
|
"category": "architecture-alignment|security|error-handling|test-coverage|complexity|dependencies",
|
|
295
|
+
"location": "relative/path/to/file.ts:42",
|
|
294
296
|
"file": "relative/path/to/file.ts",
|
|
295
297
|
"line": 42,
|
|
296
298
|
"description": "Specific description of the issue",
|
|
@@ -300,6 +302,10 @@ surfaces to this format before returning):
|
|
|
300
302
|
}
|
|
301
303
|
```
|
|
302
304
|
|
|
305
|
+
**MMR compatibility:** The `location` field (`file:line` format) is required for
|
|
306
|
+
`mmr reconcile` injection. The `file` and `line` fields are retained for backward
|
|
307
|
+
compatibility with direct channel consumers.
|
|
308
|
+
|
|
303
309
|
Store as `SUPERPOWERS_PHASE1_FINDINGS`.
|
|
304
310
|
|
|
305
311
|
### Step 5: Run Phase 2 — Parallel User Story Review
|
|
@@ -441,8 +447,8 @@ before returning. Then return all three channels' findings plus channel status:
|
|
|
441
447
|
{
|
|
442
448
|
"story": "[STORY_TITLE]",
|
|
443
449
|
"channel_status": {
|
|
444
|
-
"codex": { "root_cause": "null|not_installed|auth_failed|
|
|
445
|
-
"gemini": { "root_cause": "null|not_installed|auth_failed|
|
|
450
|
+
"codex": { "root_cause": "null|not_installed|auth_failed|timeout|failed", "coverage_status": "full|compensating" },
|
|
451
|
+
"gemini": { "root_cause": "null|not_installed|auth_failed|timeout|failed", "coverage_status": "full|compensating" },
|
|
446
452
|
"superpowers": { "root_cause": null, "coverage_status": "full" }
|
|
447
453
|
},
|
|
448
454
|
"codex": { "findings": [...] },
|
|
@@ -453,6 +459,29 @@ before returning. Then return all three channels' findings plus channel status:
|
|
|
453
459
|
|
|
454
460
|
Collect findings from all subagents. Store as `PHASE2_FINDINGS`.
|
|
455
461
|
|
|
462
|
+
### Step 5e: Optional — Inject Findings into MMR for Unified Reconciliation
|
|
463
|
+
|
|
464
|
+
If an MMR job exists (e.g., from a prior `mmr review` run on the same branch), the
|
|
465
|
+
agent can inject its post-implementation review findings into MMR for unified
|
|
466
|
+
reconciliation across all channels:
|
|
467
|
+
|
|
468
|
+
```bash
|
|
469
|
+
# Inject Phase 1 and Phase 2 findings into an existing MMR job
|
|
470
|
+
# Write agent findings to a temp file for mmr reconcile
|
|
471
|
+
echo "$AGENT_FINDINGS" > /tmp/agent-findings.json
|
|
472
|
+
mmr reconcile "$JOB_ID" --channel superpowers --input /tmp/agent-findings.json
|
|
473
|
+
```
|
|
474
|
+
|
|
475
|
+
All findings injected via `mmr reconcile` must use MMR-compatible schema: each
|
|
476
|
+
finding needs `severity` (P0-P3), `location` (file:line), and `description`
|
|
477
|
+
(`suggestion` is optional). The strict validator will reject findings with
|
|
478
|
+
missing or invalid required fields.
|
|
479
|
+
|
|
480
|
+
This step is optional — post-implementation review is a full-codebase review (not
|
|
481
|
+
diff-only), so it operates independently of `mmr review`. Use `mmr reconcile` only
|
|
482
|
+
when you want to merge post-implementation findings into an existing MMR job for a
|
|
483
|
+
single unified verdict.
|
|
484
|
+
|
|
456
485
|
### Step 6: Consolidate Findings
|
|
457
486
|
|
|
458
487
|
Merge all findings from Phase 1 (`CODEX_PHASE1_FINDINGS`, `GEMINI_PHASE1_FINDINGS`,
|
|
@@ -656,9 +685,9 @@ the user they require manual attention before the project is ready to release.
|
|
|
656
685
|
| Codex not installed (`command -v` fails) | Queue compensating pass (implementation correctness, security, API contracts, labeled `[compensating: Codex-equivalent]`); document as "not_installed" in report |
|
|
657
686
|
| Gemini not installed (`command -v` fails) | Queue compensating pass (architectural patterns, design reasoning, broad context, labeled `[compensating: Gemini-equivalent]`); document as "not_installed" in report |
|
|
658
687
|
| Codex auth expired — user recovers | Re-run auth check; proceed with full Codex channel |
|
|
659
|
-
| Codex auth expired — user declines or
|
|
688
|
+
| Codex auth expired — user declines or timeout | Queue compensating pass (implementation correctness, security, API contracts, labeled `[compensating: Codex-equivalent]`); document as "auth_failed" or "timeout" in report |
|
|
660
689
|
| Gemini auth expired (exit 41) — user recovers | Re-run auth check; proceed with full Gemini channel |
|
|
661
|
-
| Gemini auth expired — user declines or
|
|
690
|
+
| Gemini auth expired — user declines or timeout | Queue compensating pass (architectural patterns, design reasoning, broad context, labeled `[compensating: Gemini-equivalent]`); document as "auth_failed" or "timeout" in report |
|
|
662
691
|
| Channel fails during execution (non-zero exit, malformed output, timeout) | Queue compensating pass for that channel with same focus and label; document root cause in report |
|
|
663
692
|
| Both external CLIs unavailable (any combination of not_installed / auth failure) | Run all compensating passes plus Superpowers code-reviewer; report coverage as "degraded-coverage"; warn user that review coverage is reduced |
|
|
664
693
|
| Superpowers unavailable | Document as "unavailable" in report; proceed with remaining channels; Superpowers is a Claude subagent and should always be available |
|
|
@@ -23,7 +23,7 @@ anything leaves the machine.
|
|
|
23
23
|
The three channels are:
|
|
24
24
|
1. **Codex CLI** — implementation correctness, security, API contracts
|
|
25
25
|
2. **Gemini CLI** — architectural patterns, broad-context reasoning
|
|
26
|
-
3. **
|
|
26
|
+
3. **Claude CLI** — Claude subagent review of code quality, tests, and plan alignment
|
|
27
27
|
|
|
28
28
|
## Inputs
|
|
29
29
|
|
|
@@ -46,6 +46,31 @@ The three channels are:
|
|
|
46
46
|
|
|
47
47
|
## Instructions
|
|
48
48
|
|
|
49
|
+
### Primary: MMR CLI + Agent Reconcile
|
|
50
|
+
|
|
51
|
+
When the MMR CLI is installed, use it as the primary entry point:
|
|
52
|
+
|
|
53
|
+
```bash
|
|
54
|
+
# Staged changes
|
|
55
|
+
mmr review --staged --sync --format json
|
|
56
|
+
|
|
57
|
+
# Branch diff against main
|
|
58
|
+
mmr review --base main --sync --format json
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
After the CLI review completes, dispatch the agent's code-reviewer skill (4th channel) and inject findings into the MMR job for unified reconciliation:
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
# job_id is captured from mmr review --sync --format json output
|
|
65
|
+
# Write agent findings to a temp file for mmr reconcile
|
|
66
|
+
echo "$AGENT_FINDINGS" > /tmp/agent-findings.json
|
|
67
|
+
mmr reconcile "$JOB_ID" --channel superpowers --input /tmp/agent-findings.json
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
The agent's review output must use MMR-compatible finding schema: each finding needs `severity` (P0-P3), `location` (file:line), and `description` (`suggestion` is optional).
|
|
71
|
+
|
|
72
|
+
If `mmr` is not installed (`command -v mmr` fails), fall back to the manual multi-channel flow below.
|
|
73
|
+
|
|
49
74
|
### Step 1: Detect Mode
|
|
50
75
|
|
|
51
76
|
Parse `$ARGUMENTS` and set:
|
|
@@ -173,7 +198,7 @@ codex login status 2>/dev/null
|
|
|
173
198
|
- If `codex` is not installed: skip this channel and record root-cause `not_installed`
|
|
174
199
|
- If auth fails: tell the user to run `! codex login`, retry after recovery, and if recovery is not possible, record root-cause `auth_failed` and continue with the remaining channels
|
|
175
200
|
|
|
176
|
-
If auth cannot be recovered, or if Codex is not installed, queue a compensating Claude self-review pass focused on implementation correctness, security, and API contracts. Label findings as `[compensating: Codex-equivalent]`. If auth check times out (~5s), retry once; if still failing, record `
|
|
201
|
+
If auth cannot be recovered, or if Codex is not installed, queue a compensating Claude self-review pass focused on implementation correctness, security, and API contracts. Label findings as `[compensating: Codex-equivalent]`. If auth check times out (~5s), retry once; if still failing, record `timeout` and queue compensating pass. This pass runs after all channel dispatch attempts complete.
|
|
177
202
|
|
|
178
203
|
Build the prompt in a temporary file and pass it over stdin:
|
|
179
204
|
|
|
@@ -209,9 +234,9 @@ NO_BROWSER=true gemini -p "$(cat "$PROMPT_FILE")" --output-format json --approva
|
|
|
209
234
|
|
|
210
235
|
If the CLI exits with a non-zero code, produces malformed/unparseable output, or is killed by the tool runner timeout, record root-cause `failed` and queue a compensating pass for that channel.
|
|
211
236
|
|
|
212
|
-
#### Channel 3:
|
|
237
|
+
#### Channel 3: Claude CLI
|
|
213
238
|
|
|
214
|
-
Dispatch
|
|
239
|
+
Dispatch via `claude -p` with the review prompt.
|
|
215
240
|
|
|
216
241
|
- If explicit refs are being reviewed, provide `BASE_SHA` and `HEAD_SHA`
|
|
217
242
|
- Otherwise provide:
|
|
@@ -297,7 +322,7 @@ Otherwise:
|
|
|
297
322
|
3. Repeat for up to 3 fix rounds
|
|
298
323
|
4. If any finding remains unresolved after 3 rounds, stop with verdict `needs-user-decision`
|
|
299
324
|
|
|
300
|
-
**Fix cycle channel rule:** Re-run only channels that originally completed or ran as compensating passes. Never retry a channel marked `
|
|
325
|
+
**Fix cycle channel rule:** Re-run only channels that originally completed or ran as compensating passes. Never retry a channel marked `not_installed`, `auth_failed`, or `timeout` during fix rounds — its availability does not change within a session.
|
|
301
326
|
|
|
302
327
|
### Step 8: Final Verdict
|
|
303
328
|
|
|
@@ -321,9 +346,9 @@ Output a concise summary in this format:
|
|
|
321
346
|
[scope label]
|
|
322
347
|
|
|
323
348
|
### Channels Executed
|
|
324
|
-
- Codex CLI — root cause: [completed /
|
|
325
|
-
- Gemini CLI — root cause: [completed /
|
|
326
|
-
-
|
|
349
|
+
- Codex CLI — root cause: [completed / not_installed / auth_failed / timeout / failed], coverage: [full / compensating (Codex-equivalent)]
|
|
350
|
+
- Gemini CLI — root cause: [completed / not_installed / auth_failed / timeout / failed], coverage: [full / compensating (Gemini-equivalent)]
|
|
351
|
+
- Claude CLI — root cause: [completed / not_installed / auth_failed / timeout / failed], coverage: [full / compensating]
|
|
327
352
|
|
|
328
353
|
### Findings
|
|
329
354
|
[consensus findings first, then single-source findings]
|