get-research-done 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +560 -0
- package/agents/grd-architect.md +789 -0
- package/agents/grd-codebase-mapper.md +738 -0
- package/agents/grd-critic.md +1065 -0
- package/agents/grd-debugger.md +1203 -0
- package/agents/grd-evaluator.md +948 -0
- package/agents/grd-executor.md +784 -0
- package/agents/grd-explorer.md +2063 -0
- package/agents/grd-graduator.md +484 -0
- package/agents/grd-integration-checker.md +423 -0
- package/agents/grd-phase-researcher.md +641 -0
- package/agents/grd-plan-checker.md +745 -0
- package/agents/grd-planner.md +1386 -0
- package/agents/grd-project-researcher.md +865 -0
- package/agents/grd-research-synthesizer.md +256 -0
- package/agents/grd-researcher.md +2361 -0
- package/agents/grd-roadmapper.md +605 -0
- package/agents/grd-verifier.md +778 -0
- package/bin/install.js +1294 -0
- package/commands/grd/add-phase.md +207 -0
- package/commands/grd/add-todo.md +193 -0
- package/commands/grd/architect.md +283 -0
- package/commands/grd/audit-milestone.md +277 -0
- package/commands/grd/check-todos.md +228 -0
- package/commands/grd/complete-milestone.md +136 -0
- package/commands/grd/debug.md +169 -0
- package/commands/grd/discuss-phase.md +86 -0
- package/commands/grd/evaluate.md +1095 -0
- package/commands/grd/execute-phase.md +339 -0
- package/commands/grd/explore.md +258 -0
- package/commands/grd/graduate.md +323 -0
- package/commands/grd/help.md +482 -0
- package/commands/grd/insert-phase.md +227 -0
- package/commands/grd/insights.md +231 -0
- package/commands/grd/join-discord.md +18 -0
- package/commands/grd/list-phase-assumptions.md +50 -0
- package/commands/grd/map-codebase.md +71 -0
- package/commands/grd/new-milestone.md +721 -0
- package/commands/grd/new-project.md +1008 -0
- package/commands/grd/pause-work.md +134 -0
- package/commands/grd/plan-milestone-gaps.md +295 -0
- package/commands/grd/plan-phase.md +525 -0
- package/commands/grd/progress.md +364 -0
- package/commands/grd/quick-explore.md +236 -0
- package/commands/grd/quick.md +309 -0
- package/commands/grd/remove-phase.md +349 -0
- package/commands/grd/research-phase.md +200 -0
- package/commands/grd/research.md +681 -0
- package/commands/grd/resume-work.md +40 -0
- package/commands/grd/set-profile.md +106 -0
- package/commands/grd/settings.md +136 -0
- package/commands/grd/update.md +172 -0
- package/commands/grd/verify-work.md +219 -0
- package/get-research-done/config/default.json +15 -0
- package/get-research-done/references/checkpoints.md +1078 -0
- package/get-research-done/references/continuation-format.md +249 -0
- package/get-research-done/references/git-integration.md +254 -0
- package/get-research-done/references/model-profiles.md +73 -0
- package/get-research-done/references/planning-config.md +94 -0
- package/get-research-done/references/questioning.md +141 -0
- package/get-research-done/references/tdd.md +263 -0
- package/get-research-done/references/ui-brand.md +160 -0
- package/get-research-done/references/verification-patterns.md +612 -0
- package/get-research-done/templates/DEBUG.md +159 -0
- package/get-research-done/templates/UAT.md +247 -0
- package/get-research-done/templates/archive-reason.md +195 -0
- package/get-research-done/templates/codebase/architecture.md +255 -0
- package/get-research-done/templates/codebase/concerns.md +310 -0
- package/get-research-done/templates/codebase/conventions.md +307 -0
- package/get-research-done/templates/codebase/integrations.md +280 -0
- package/get-research-done/templates/codebase/stack.md +186 -0
- package/get-research-done/templates/codebase/structure.md +285 -0
- package/get-research-done/templates/codebase/testing.md +480 -0
- package/get-research-done/templates/config.json +35 -0
- package/get-research-done/templates/context.md +283 -0
- package/get-research-done/templates/continue-here.md +78 -0
- package/get-research-done/templates/critic-log.md +288 -0
- package/get-research-done/templates/data-report.md +173 -0
- package/get-research-done/templates/debug-subagent-prompt.md +91 -0
- package/get-research-done/templates/decision-log.md +58 -0
- package/get-research-done/templates/decision.md +138 -0
- package/get-research-done/templates/discovery.md +146 -0
- package/get-research-done/templates/experiment-readme.md +104 -0
- package/get-research-done/templates/graduated-script.md +180 -0
- package/get-research-done/templates/iteration-summary.md +234 -0
- package/get-research-done/templates/milestone-archive.md +123 -0
- package/get-research-done/templates/milestone.md +115 -0
- package/get-research-done/templates/objective.md +271 -0
- package/get-research-done/templates/phase-prompt.md +567 -0
- package/get-research-done/templates/planner-subagent-prompt.md +117 -0
- package/get-research-done/templates/project.md +184 -0
- package/get-research-done/templates/requirements.md +231 -0
- package/get-research-done/templates/research-project/ARCHITECTURE.md +204 -0
- package/get-research-done/templates/research-project/FEATURES.md +147 -0
- package/get-research-done/templates/research-project/PITFALLS.md +200 -0
- package/get-research-done/templates/research-project/STACK.md +120 -0
- package/get-research-done/templates/research-project/SUMMARY.md +170 -0
- package/get-research-done/templates/research.md +529 -0
- package/get-research-done/templates/roadmap.md +202 -0
- package/get-research-done/templates/scorecard.json +113 -0
- package/get-research-done/templates/state.md +287 -0
- package/get-research-done/templates/summary.md +246 -0
- package/get-research-done/templates/user-setup.md +311 -0
- package/get-research-done/templates/verification-report.md +322 -0
- package/get-research-done/workflows/complete-milestone.md +756 -0
- package/get-research-done/workflows/diagnose-issues.md +231 -0
- package/get-research-done/workflows/discovery-phase.md +289 -0
- package/get-research-done/workflows/discuss-phase.md +433 -0
- package/get-research-done/workflows/execute-phase.md +657 -0
- package/get-research-done/workflows/execute-plan.md +1844 -0
- package/get-research-done/workflows/list-phase-assumptions.md +178 -0
- package/get-research-done/workflows/map-codebase.md +322 -0
- package/get-research-done/workflows/resume-project.md +307 -0
- package/get-research-done/workflows/transition.md +556 -0
- package/get-research-done/workflows/verify-phase.md +628 -0
- package/get-research-done/workflows/verify-work.md +596 -0
- package/hooks/dist/grd-check-update.js +61 -0
- package/hooks/dist/grd-statusline.js +84 -0
- package/package.json +47 -0
- package/scripts/audit-help-commands.sh +115 -0
- package/scripts/build-hooks.js +42 -0
- package/scripts/verify-all-commands.sh +246 -0
- package/scripts/verify-architect-warning.sh +35 -0
- package/scripts/verify-insights-mode.sh +40 -0
- package/scripts/verify-quick-mode.sh +20 -0
- package/scripts/verify-revise-data-routing.sh +139 -0
|
@@ -0,0 +1,180 @@
|
|
|
1
|
+
# Graduated Script Template
|
|
2
|
+
|
|
3
|
+
This template is used when graduating an exploration notebook to a validated Python script via `/grd:graduate`.
|
|
4
|
+
|
|
5
|
+
## Template Variables
|
|
6
|
+
|
|
7
|
+
| Variable | Source | Description |
|
|
8
|
+
|----------|--------|-------------|
|
|
9
|
+
| `{{experiment_name}}` | OBJECTIVE.md or notebook filename | Human-readable experiment name |
|
|
10
|
+
| `{{source_notebook}}` | User input | Path to source notebook (e.g., `notebooks/exploration/001_initial.ipynb`) |
|
|
11
|
+
| `{{source_run}}` | Run directory | Run that achieved PROCEED verdict (e.g., `experiments/run_003_baseline`) |
|
|
12
|
+
| `{{critic_verdict}}` | CRITIC.md | Critic verdict (always PROCEED for graduation) |
|
|
13
|
+
| `{{verdict_date}}` | CRITIC.md | ISO 8601 date of verdict |
|
|
14
|
+
| `{{graduation_timestamp}}` | System | ISO 8601 timestamp when graduated |
|
|
15
|
+
|
|
16
|
+
## Python Script Template
|
|
17
|
+
|
|
18
|
+
```python
|
|
19
|
+
"""
|
|
20
|
+
Validated experiment: {{experiment_name}}
|
|
21
|
+
|
|
22
|
+
Source notebook: {{source_notebook}}
|
|
23
|
+
Source run: {{source_run}}
|
|
24
|
+
Critic verdict: {{critic_verdict}} ({{verdict_date}})
|
|
25
|
+
Graduated: {{graduation_timestamp}}
|
|
26
|
+
|
|
27
|
+
MANUAL REFACTORING REQUIRED:
|
|
28
|
+
- [ ] Remove/convert magic commands (grep "^%")
|
|
29
|
+
- [ ] Extract code into functions
|
|
30
|
+
- [ ] Replace parameter cell with argparse
|
|
31
|
+
- [ ] Add docstrings and type hints
|
|
32
|
+
- [ ] Set all random seeds explicitly
|
|
33
|
+
- [ ] Write tests for core functions
|
|
34
|
+
|
|
35
|
+
This script was auto-generated from a validated notebook.
|
|
36
|
+
Review and complete the refactoring checklist above before production use.
|
|
37
|
+
"""
|
|
38
|
+
|
|
39
|
+
import argparse
|
|
40
|
+
import random
|
|
41
|
+
from typing import Any
|
|
42
|
+
|
|
43
|
+
import numpy as np
|
|
44
|
+
|
|
45
|
+
# Uncomment if using PyTorch:
|
|
46
|
+
# import torch
|
|
47
|
+
|
|
48
|
+
# Uncomment if using TensorFlow:
|
|
49
|
+
# import tensorflow as tf
|
|
50
|
+
|
|
51
|
+
|
|
52
|
+
def set_random_seeds(seed: int = 42) -> None:
|
|
53
|
+
"""Set all random seeds for reproducibility.
|
|
54
|
+
|
|
55
|
+
Args:
|
|
56
|
+
seed: Random seed value (default: 42)
|
|
57
|
+
"""
|
|
58
|
+
random.seed(seed)
|
|
59
|
+
np.random.seed(seed)
|
|
60
|
+
|
|
61
|
+
# Uncomment for PyTorch:
|
|
62
|
+
# torch.manual_seed(seed)
|
|
63
|
+
# torch.cuda.manual_seed_all(seed)
|
|
64
|
+
# torch.backends.cudnn.deterministic = True
|
|
65
|
+
# torch.backends.cudnn.benchmark = False
|
|
66
|
+
|
|
67
|
+
# Uncomment for TensorFlow:
|
|
68
|
+
# tf.random.set_seed(seed)
|
|
69
|
+
|
|
70
|
+
|
|
71
|
+
def main(args: argparse.Namespace) -> dict[str, Any]:
|
|
72
|
+
"""Main experiment entry point.
|
|
73
|
+
|
|
74
|
+
Args:
|
|
75
|
+
args: Parsed command line arguments
|
|
76
|
+
|
|
77
|
+
Returns:
|
|
78
|
+
Dictionary containing experiment results/metrics
|
|
79
|
+
"""
|
|
80
|
+
set_random_seeds(args.random_seed)
|
|
81
|
+
|
|
82
|
+
# ---------------------------------------------------------------------
|
|
83
|
+
# TODO: Implement experiment logic here
|
|
84
|
+
# Extracted from: {{source_notebook}}
|
|
85
|
+
#
|
|
86
|
+
# Refactoring guidance:
|
|
87
|
+
# 1. Convert notebook cells to functions
|
|
88
|
+
# 2. Add type hints to all function signatures
|
|
89
|
+
# 3. Replace hardcoded values with args.* parameters
|
|
90
|
+
# 4. Return metrics as a dictionary for logging
|
|
91
|
+
# ---------------------------------------------------------------------
|
|
92
|
+
|
|
93
|
+
results = {
|
|
94
|
+
"status": "not_implemented",
|
|
95
|
+
"message": "Replace this with actual experiment implementation"
|
|
96
|
+
}
|
|
97
|
+
|
|
98
|
+
return results
|
|
99
|
+
|
|
100
|
+
|
|
101
|
+
if __name__ == "__main__":
|
|
102
|
+
parser = argparse.ArgumentParser(
|
|
103
|
+
description="{{experiment_name}}",
|
|
104
|
+
formatter_class=argparse.ArgumentDefaultsHelpFormatter
|
|
105
|
+
)
|
|
106
|
+
|
|
107
|
+
# Standard arguments
|
|
108
|
+
parser.add_argument(
|
|
109
|
+
"--random-seed",
|
|
110
|
+
type=int,
|
|
111
|
+
default=42,
|
|
112
|
+
help="Random seed for reproducibility"
|
|
113
|
+
)
|
|
114
|
+
|
|
115
|
+
# TODO: Add experiment-specific arguments
|
|
116
|
+
# Example:
|
|
117
|
+
# parser.add_argument("--learning-rate", type=float, default=0.001)
|
|
118
|
+
# parser.add_argument("--epochs", type=int, default=100)
|
|
119
|
+
# parser.add_argument("--batch-size", type=int, default=32)
|
|
120
|
+
# parser.add_argument("--data-path", type=str, required=True)
|
|
121
|
+
|
|
122
|
+
args = parser.parse_args()
|
|
123
|
+
|
|
124
|
+
results = main(args)
|
|
125
|
+
print(f"Experiment complete: {results}")
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
## Usage
|
|
129
|
+
|
|
130
|
+
The graduation workflow:
|
|
131
|
+
|
|
132
|
+
1. **Notebook achieves PROCEED verdict** via `/grd:research`
|
|
133
|
+
2. **User initiates graduation** via `/grd:graduate`
|
|
134
|
+
3. **System converts notebook to script** using nbconvert
|
|
135
|
+
4. **System prepends this header template** with filled variables
|
|
136
|
+
5. **Script lands in** `src/experiments/{{sanitized_experiment_name}}.py`
|
|
137
|
+
6. **User completes refactoring checklist** before production use
|
|
138
|
+
|
|
139
|
+
## Refactoring Checklist Details
|
|
140
|
+
|
|
141
|
+
### Remove/convert magic commands
|
|
142
|
+
```bash
|
|
143
|
+
# Find magic commands in the generated script
|
|
144
|
+
grep "^%" src/experiments/your_script.py
|
|
145
|
+
```
|
|
146
|
+
Common conversions:
|
|
147
|
+
- `%matplotlib inline` - Remove (not needed in scripts)
|
|
148
|
+
- `%load_ext autoreload` - Remove (not applicable)
|
|
149
|
+
- `%%time` - Replace with `time.time()` or profiling
|
|
150
|
+
- `!pip install` - Move to requirements.txt
|
|
151
|
+
|
|
152
|
+
### Extract code into functions
|
|
153
|
+
- Each logical block should be a function
|
|
154
|
+
- Functions should have single responsibility
|
|
155
|
+
- Return values instead of relying on globals
|
|
156
|
+
|
|
157
|
+
### Replace parameter cell with argparse
|
|
158
|
+
- Identify cells tagged 'parameters' or variable definitions
|
|
159
|
+
- Convert each parameter to `parser.add_argument()`
|
|
160
|
+
- Use type hints and defaults from original values
|
|
161
|
+
|
|
162
|
+
### Add docstrings and type hints
|
|
163
|
+
- Every function needs a docstring
|
|
164
|
+
- Use Google or NumPy docstring style consistently
|
|
165
|
+
- Add type hints to all arguments and return values
|
|
166
|
+
|
|
167
|
+
### Set all random seeds explicitly
|
|
168
|
+
- Call `set_random_seeds()` at start of `main()`
|
|
169
|
+
- Pass seed through all library calls that accept it
|
|
170
|
+
- Document any randomness that cannot be seeded
|
|
171
|
+
|
|
172
|
+
### Write tests for core functions
|
|
173
|
+
- Create `tests/test_{{sanitized_experiment_name}}.py`
|
|
174
|
+
- Test pure functions with known inputs/outputs
|
|
175
|
+
- Test edge cases and error conditions
|
|
176
|
+
|
|
177
|
+
---
|
|
178
|
+
|
|
179
|
+
*Template version: 1.0*
|
|
180
|
+
*Phase: 06-notebook-support*
|
|
@@ -0,0 +1,234 @@
|
|
|
1
|
+
# Iteration Summary Template
|
|
2
|
+
|
|
3
|
+
Template for `experiments/archive/YYYY-MM-DD_hypothesis_name/ITERATION_SUMMARY.md`.
|
|
4
|
+
|
|
5
|
+
Collapses all iteration attempts into a single summary document when archiving.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## File Template
|
|
10
|
+
|
|
11
|
+
```markdown
|
|
12
|
+
# Iteration Summary: {{hypothesis_name}}
|
|
13
|
+
|
|
14
|
+
**Total Iterations:** {{N}}
|
|
15
|
+
**Date Range:** {{first_run_date}} to {{last_run_date}}
|
|
16
|
+
**Outcome:** Archived (hypothesis abandoned)
|
|
17
|
+
|
|
18
|
+
## Iteration History
|
|
19
|
+
|
|
20
|
+
| # | Run | Date | Verdict | Confidence | Key Metric | Notes |
|
|
21
|
+
|---|-----|------|---------|------------|------------|-------|
|
|
22
|
+
| 1 | run_001_baseline | YYYY-MM-DD | REVISE_METHOD | MEDIUM | F1=0.72 | Initial attempt |
|
|
23
|
+
| 2 | run_002_tuned | YYYY-MM-DD | REVISE_METHOD | MEDIUM | F1=0.76 | Hyperparameter tuning |
|
|
24
|
+
| 3 | run_003_final | YYYY-MM-DD | ESCALATE | LOW | F1=0.78 | Limit reached |
|
|
25
|
+
|
|
26
|
+
## Metric Trend
|
|
27
|
+
|
|
28
|
+
**Best achieved:** {{metric}}={{best_value}}
|
|
29
|
+
**Target:** {{threshold}}
|
|
30
|
+
**Gap:** {{best_value - threshold}}
|
|
31
|
+
**Trend:** {{improving|stagnant|degrading}}
|
|
32
|
+
|
|
33
|
+
## Verdict Distribution
|
|
34
|
+
|
|
35
|
+
- PROCEED: {{count}}
|
|
36
|
+
- REVISE_METHOD: {{count}}
|
|
37
|
+
- REVISE_DATA: {{count}}
|
|
38
|
+
- ESCALATE: {{count}}
|
|
39
|
+
|
|
40
|
+
## Key Observations
|
|
41
|
+
|
|
42
|
+
{{Summary of what was tried and why it didn't work}}
|
|
43
|
+
|
|
44
|
+
## Preserved Artifacts
|
|
45
|
+
|
|
46
|
+
- Final run: {{run_NNN_description}}/
|
|
47
|
+
- All CRITIC_LOG.md files (merged or individual)
|
|
48
|
+
- Final SCORECARD.json
|
|
49
|
+
|
|
50
|
+
---
|
|
51
|
+
|
|
52
|
+
*Summary generated on archive. See ARCHIVE_REASON.md for human rationale.*
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
57
|
+
## Usage Notes
|
|
58
|
+
|
|
59
|
+
**Field descriptions:**
|
|
60
|
+
|
|
61
|
+
- **hypothesis_name:** Human-readable name from OBJECTIVE.md "what" section
|
|
62
|
+
- **N:** Total iteration count (number of runs attempted)
|
|
63
|
+
- **first_run_date:** Timestamp from earliest run directory
|
|
64
|
+
- **last_run_date:** Timestamp from final run directory
|
|
65
|
+
- **Outcome:** Always "Archived (hypothesis abandoned)" for this template
|
|
66
|
+
|
|
67
|
+
**Iteration History table:**
|
|
68
|
+
|
|
69
|
+
Populate from all run directories in experiments/:
|
|
70
|
+
- **#:** Sequential iteration number (1, 2, 3...)
|
|
71
|
+
- **Run:** Run directory name (run_001_baseline, run_002_tuned, etc.)
|
|
72
|
+
- **Date:** Extraction date from CRITIC_LOG.md or directory timestamp
|
|
73
|
+
- **Verdict:** Critic verdict (PROCEED, REVISE_METHOD, REVISE_DATA, ESCALATE)
|
|
74
|
+
- **Confidence:** Critic confidence level (HIGH, MEDIUM, LOW)
|
|
75
|
+
- **Key Metric:** Primary metric value from SCORECARD.json (e.g., "F1=0.72")
|
|
76
|
+
- **Notes:** Brief description of what was tried (from run README.md or CRITIC_LOG.md)
|
|
77
|
+
|
|
78
|
+
**Metric Trend section:**
|
|
79
|
+
|
|
80
|
+
- **Best achieved:** Highest value attained across all iterations for primary metric
|
|
81
|
+
- **Target:** Threshold from OBJECTIVE.md for that metric
|
|
82
|
+
- **Gap:** Difference (best_value - threshold), show with sign
|
|
83
|
+
- **Trend:** Classify based on metric progression:
|
|
84
|
+
- "improving" if values consistently increase across iterations
|
|
85
|
+
- "stagnant" if values plateau or fluctuate without progress
|
|
86
|
+
- "degrading" if values decrease (rare but possible with overfitting)
|
|
87
|
+
|
|
88
|
+
**Verdict Distribution:**
|
|
89
|
+
|
|
90
|
+
Count occurrences of each verdict type across all runs:
|
|
91
|
+
```bash
|
|
92
|
+
# Example calculation
|
|
93
|
+
PROCEED=$(grep -h "^\*\*Verdict:\*\* PROCEED" experiments/run_*/CRITIC_LOG.md | wc -l)
|
|
94
|
+
REVISE_METHOD=$(grep -h "^\*\*Verdict:\*\* REVISE_METHOD" experiments/run_*/CRITIC_LOG.md | wc -l)
|
|
95
|
+
REVISE_DATA=$(grep -h "^\*\*Verdict:\*\* REVISE_DATA" experiments/run_*/CRITIC_LOG.md | wc -l)
|
|
96
|
+
ESCALATE=$(grep -h "^\*\*Verdict:\*\* ESCALATE" experiments/run_*/CRITIC_LOG.md | wc -l)
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
**Key Observations section:**
|
|
100
|
+
|
|
101
|
+
Synthesize learnings from all iterations. Extract from:
|
|
102
|
+
- CRITIC_LOG.md strengths/weaknesses across runs
|
|
103
|
+
- Patterns in metric progression
|
|
104
|
+
- What approaches were attempted (architectures, hyperparameters, data transformations)
|
|
105
|
+
- Why none succeeded (common failure mode)
|
|
106
|
+
|
|
107
|
+
Examples:
|
|
108
|
+
- "All iterations struggled with class imbalance (99.2% negative). Resampling techniques (SMOTE, undersampling) did not improve recall."
|
|
109
|
+
- "Metric progression stagnated after run 2. Runs 3-5 showed no improvement despite hyperparameter changes, suggesting architectural limitation."
|
|
110
|
+
- "Data leakage warnings in DATA_REPORT.md were confirmed. Removing leaked features in run 4-5 caused significant performance drop, revealing hypothesis depended on invalid signal."
|
|
111
|
+
|
|
112
|
+
**Preserved Artifacts section:**
|
|
113
|
+
|
|
114
|
+
List what's kept in the archive:
|
|
115
|
+
- Final run directory (moved from experiments/)
|
|
116
|
+
- CRITIC_LOG.md files (individual or merged)
|
|
117
|
+
- Final SCORECARD.json
|
|
118
|
+
- ARCHIVE_REASON.md (user rationale)
|
|
119
|
+
- This ITERATION_SUMMARY.md
|
|
120
|
+
|
|
121
|
+
**Example populated template:**
|
|
122
|
+
|
|
123
|
+
```markdown
|
|
124
|
+
# Iteration Summary: Ensemble Methods for Fraud Detection
|
|
125
|
+
|
|
126
|
+
**Total Iterations:** 5
|
|
127
|
+
**Date Range:** 2026-01-15 to 2026-01-30
|
|
128
|
+
**Outcome:** Archived (hypothesis abandoned)
|
|
129
|
+
|
|
130
|
+
## Iteration History
|
|
131
|
+
|
|
132
|
+
| # | Run | Date | Verdict | Confidence | Key Metric | Notes |
|
|
133
|
+
|---|-----|------|---------|------------|------------|-------|
|
|
134
|
+
| 1 | run_001_baseline | 2026-01-15 | REVISE_METHOD | MEDIUM | F1=0.52 | Initial ensemble (RF+GBM) |
|
|
135
|
+
| 2 | run_002_smote | 2026-01-18 | REVISE_METHOD | MEDIUM | F1=0.56 | Added SMOTE resampling |
|
|
136
|
+
| 3 | run_003_weighted | 2026-01-22 | REVISE_METHOD | MEDIUM | F1=0.58 | Class weight optimization |
|
|
137
|
+
| 4 | run_004_reduced | 2026-01-25 | REVISE_METHOD | LOW | F1=0.57 | Feature reduction (237→15) |
|
|
138
|
+
| 5 | run_005_final | 2026-01-30 | ESCALATE | LOW | F1=0.58 | Limit reached |
|
|
139
|
+
|
|
140
|
+
## Metric Trend
|
|
141
|
+
|
|
142
|
+
**Best achieved:** F1=0.58
|
|
143
|
+
**Target:** 0.85
|
|
144
|
+
**Gap:** -0.27
|
|
145
|
+
**Trend:** stagnant (plateaued at 0.56-0.58 after run 2)
|
|
146
|
+
|
|
147
|
+
## Verdict Distribution
|
|
148
|
+
|
|
149
|
+
- PROCEED: 0
|
|
150
|
+
- REVISE_METHOD: 4
|
|
151
|
+
- REVISE_DATA: 0
|
|
152
|
+
- ESCALATE: 1
|
|
153
|
+
|
|
154
|
+
## Key Observations
|
|
155
|
+
|
|
156
|
+
All five iterations struggled with severe class imbalance (99.2% negative class, N=120 positive examples). Despite trying multiple approaches (SMOTE resampling, class weighting, feature reduction), F1 score plateaued at 0.56-0.58, far below the target of 0.85.
|
|
157
|
+
|
|
158
|
+
**What was tried:**
|
|
159
|
+
- Run 1: Baseline ensemble (Random Forest + Gradient Boosting)
|
|
160
|
+
- Run 2: SMOTE oversampling to balance classes
|
|
161
|
+
- Run 3: Class weight optimization (weighted loss functions)
|
|
162
|
+
- Run 4: Feature reduction (237→15 features) to reduce overfitting
|
|
163
|
+
- Run 5: Combined approach with focal loss
|
|
164
|
+
|
|
165
|
+
**Common failure mode:**
|
|
166
|
+
All iterations achieved high precision (>0.90) but poor recall (<0.42), indicating the model learned to be conservative due to extreme class imbalance. Resampling techniques introduced artificial patterns that didn't generalize. The fundamental issue is insufficient positive examples (N=120) for the hypothesis to be testable with ensemble methods.
|
|
167
|
+
|
|
168
|
+
**Critic's repeated concerns:**
|
|
169
|
+
- "Limited positive examples prevent ensemble diversity"
|
|
170
|
+
- "High precision but poor recall suggests severe class imbalance"
|
|
171
|
+
- "Feature space too large relative to positive sample size"
|
|
172
|
+
|
|
173
|
+
## Preserved Artifacts
|
|
174
|
+
|
|
175
|
+
- Final run: run_005_final/ (complete snapshot with DECISION.md)
|
|
176
|
+
- All CRITIC_LOG.md files (archived individually in final_run/)
|
|
177
|
+
- Final SCORECARD.json (F1=0.58, composite=0.64)
|
|
178
|
+
- ARCHIVE_REASON.md (user rationale for abandonment)
|
|
179
|
+
- This ITERATION_SUMMARY.md
|
|
180
|
+
|
|
181
|
+
---
|
|
182
|
+
|
|
183
|
+
*Summary generated on archive. See ARCHIVE_REASON.md for human rationale.*
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
---
|
|
187
|
+
|
|
188
|
+
## Integration
|
|
189
|
+
|
|
190
|
+
This template is used by `/grd:evaluate` command in Phase 5 (Archive Handling) when user selects "Archive" decision.
|
|
191
|
+
|
|
192
|
+
**Inputs:**
|
|
193
|
+
- All run directories in experiments/ (run_001, run_002, ...)
|
|
194
|
+
- CRITIC_LOG.md from each run (verdict, confidence, recommendations)
|
|
195
|
+
- SCORECARD.json from each run (metrics, composite score)
|
|
196
|
+
- OBJECTIVE.md (hypothesis, target thresholds)
|
|
197
|
+
|
|
198
|
+
**Generation logic:**
|
|
199
|
+
|
|
200
|
+
```bash
|
|
201
|
+
# Collect all runs
|
|
202
|
+
RUNS=$(ls -1d experiments/run_* | sort)
|
|
203
|
+
TOTAL=$(echo "$RUNS" | wc -l | tr -d ' ')
|
|
204
|
+
|
|
205
|
+
# Extract date range
|
|
206
|
+
FIRST_DATE=$(stat -f "%Sm" -t "%Y-%m-%d" $(echo "$RUNS" | head -1))
|
|
207
|
+
LAST_DATE=$(stat -f "%Sm" -t "%Y-%m-%d" $(echo "$RUNS" | tail -1))
|
|
208
|
+
|
|
209
|
+
# Build iteration history table
|
|
210
|
+
for run in $RUNS; do
|
|
211
|
+
VERDICT=$(grep "^\*\*Verdict:\*\*" "$run/CRITIC_LOG.md" | head -1 | awk '{print $2}')
|
|
212
|
+
CONFIDENCE=$(grep "^\*\*Confidence:\*\*" "$run/CRITIC_LOG.md" | head -1 | awk '{print $2}')
|
|
213
|
+
METRIC=$(jq -r '.metrics[0] | "\(.name)=\(.value)"' "$run/metrics/SCORECARD.json")
|
|
214
|
+
NOTES=$(head -1 "$run/README.md" | sed 's/^# //')
|
|
215
|
+
echo "| $i | $(basename $run) | $DATE | $VERDICT | $CONFIDENCE | $METRIC | $NOTES |"
|
|
216
|
+
done
|
|
217
|
+
|
|
218
|
+
# Calculate metric trend
|
|
219
|
+
BEST_METRIC=$(jq -s 'map(.metrics[0].value) | max' experiments/run_*/metrics/SCORECARD.json)
|
|
220
|
+
TARGET=$(jq -r '.metrics[0].threshold' .planning/OBJECTIVE.md)
|
|
221
|
+
GAP=$(echo "$BEST_METRIC - $TARGET" | bc)
|
|
222
|
+
|
|
223
|
+
# Determine trend (simplified)
|
|
224
|
+
if [ "$GAP" -lt -0.10 ]; then
|
|
225
|
+
TREND="stagnant (far from target)"
|
|
226
|
+
else
|
|
227
|
+
TREND="improving (but insufficient)"
|
|
228
|
+
fi
|
|
229
|
+
```
|
|
230
|
+
|
|
231
|
+
**Output:**
|
|
232
|
+
- experiments/archive/YYYY-MM-DD_hypothesis_slug/ITERATION_SUMMARY.md
|
|
233
|
+
- Referenced by ARCHIVE_REASON.md in same directory
|
|
234
|
+
- Provides historical context for negative result
|
|
@@ -0,0 +1,123 @@
|
|
|
1
|
+
# Milestone Archive Template
|
|
2
|
+
|
|
3
|
+
This template is used by the complete-milestone workflow to create archive files in `.planning/milestones/`.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## File Template
|
|
8
|
+
|
|
9
|
+
# Milestone v{{VERSION}}: {{MILESTONE_NAME}}
|
|
10
|
+
|
|
11
|
+
**Status:** ✅ SHIPPED {{DATE}}
|
|
12
|
+
**Phases:** {{PHASE_START}}-{{PHASE_END}}
|
|
13
|
+
**Total Plans:** {{TOTAL_PLANS}}
|
|
14
|
+
|
|
15
|
+
## Overview
|
|
16
|
+
|
|
17
|
+
{{MILESTONE_DESCRIPTION}}
|
|
18
|
+
|
|
19
|
+
## Phases
|
|
20
|
+
|
|
21
|
+
{{PHASES_SECTION}}
|
|
22
|
+
|
|
23
|
+
[For each phase in this milestone, include:]
|
|
24
|
+
|
|
25
|
+
### Phase {{PHASE_NUM}}: {{PHASE_NAME}}
|
|
26
|
+
|
|
27
|
+
**Goal**: {{PHASE_GOAL}}
|
|
28
|
+
**Depends on**: {{DEPENDS_ON}}
|
|
29
|
+
**Plans**: {{PLAN_COUNT}} plans
|
|
30
|
+
|
|
31
|
+
Plans:
|
|
32
|
+
|
|
33
|
+
- [x] {{PHASE}}-01: {{PLAN_DESCRIPTION}}
|
|
34
|
+
- [x] {{PHASE}}-02: {{PLAN_DESCRIPTION}}
|
|
35
|
+
[... all plans ...]
|
|
36
|
+
|
|
37
|
+
**Details:**
|
|
38
|
+
{{PHASE_DETAILS_FROM_ROADMAP}}
|
|
39
|
+
|
|
40
|
+
**For decimal phases, include (INSERTED) marker:**
|
|
41
|
+
|
|
42
|
+
### Phase 2.1: Critical Security Patch (INSERTED)
|
|
43
|
+
|
|
44
|
+
**Goal**: Fix authentication bypass vulnerability
|
|
45
|
+
**Depends on**: Phase 2
|
|
46
|
+
**Plans**: 1 plan
|
|
47
|
+
|
|
48
|
+
Plans:
|
|
49
|
+
|
|
50
|
+
- [x] 02.1-01: Patch auth vulnerability
|
|
51
|
+
|
|
52
|
+
**Details:**
|
|
53
|
+
{{PHASE_DETAILS_FROM_ROADMAP}}
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
57
|
+
## Milestone Summary
|
|
58
|
+
|
|
59
|
+
**Decimal Phases:**
|
|
60
|
+
|
|
61
|
+
- Phase 2.1: Critical Security Patch (inserted after Phase 2 for urgent fix)
|
|
62
|
+
- Phase 5.1: Performance Hotfix (inserted after Phase 5 for production issue)
|
|
63
|
+
|
|
64
|
+
**Key Decisions:**
|
|
65
|
+
{{DECISIONS_FROM_PROJECT_STATE}}
|
|
66
|
+
[Example:]
|
|
67
|
+
|
|
68
|
+
- Decision: Use ROADMAP.md split (Rationale: Constant context cost)
|
|
69
|
+
- Decision: Decimal phase numbering (Rationale: Clear insertion semantics)
|
|
70
|
+
|
|
71
|
+
**Issues Resolved:**
|
|
72
|
+
{{ISSUES_RESOLVED_DURING_MILESTONE}}
|
|
73
|
+
[Example:]
|
|
74
|
+
|
|
75
|
+
- Fixed context overflow at 100+ phases
|
|
76
|
+
- Resolved phase insertion confusion
|
|
77
|
+
|
|
78
|
+
**Issues Deferred:**
|
|
79
|
+
{{ISSUES_DEFERRED_TO_LATER}}
|
|
80
|
+
[Example:]
|
|
81
|
+
|
|
82
|
+
- PROJECT-STATE.md tiering (deferred until decisions > 300)
|
|
83
|
+
|
|
84
|
+
**Technical Debt Incurred:**
|
|
85
|
+
{{SHORTCUTS_NEEDING_FUTURE_WORK}}
|
|
86
|
+
[Example:]
|
|
87
|
+
|
|
88
|
+
- Some workflows still have hardcoded paths (fix in Phase 5)
|
|
89
|
+
|
|
90
|
+
---
|
|
91
|
+
|
|
92
|
+
_For current project status, see .planning/ROADMAP.md_
|
|
93
|
+
|
|
94
|
+
---
|
|
95
|
+
|
|
96
|
+
## Usage Guidelines
|
|
97
|
+
|
|
98
|
+
<guidelines>
|
|
99
|
+
**When to create milestone archives:**
|
|
100
|
+
- After completing all phases in a milestone (v1.0, v1.1, v2.0, etc.)
|
|
101
|
+
- Triggered by complete-milestone workflow
|
|
102
|
+
- Before planning next milestone work
|
|
103
|
+
|
|
104
|
+
**How to fill template:**
|
|
105
|
+
|
|
106
|
+
- Replace {{PLACEHOLDERS}} with actual values
|
|
107
|
+
- Extract phase details from ROADMAP.md
|
|
108
|
+
- Document decimal phases with (INSERTED) marker
|
|
109
|
+
- Include key decisions from PROJECT-STATE.md or SUMMARY files
|
|
110
|
+
- List issues resolved vs deferred
|
|
111
|
+
- Capture technical debt for future reference
|
|
112
|
+
|
|
113
|
+
**Archive location:**
|
|
114
|
+
|
|
115
|
+
- Save to `.planning/milestones/v{VERSION}-{NAME}.md`
|
|
116
|
+
- Example: `.planning/milestones/v1.0-mvp.md`
|
|
117
|
+
|
|
118
|
+
**After archiving:**
|
|
119
|
+
|
|
120
|
+
- Update ROADMAP.md to collapse completed milestone in `<details>` tag
|
|
121
|
+
- Update PROJECT.md to brownfield format with Current State section
|
|
122
|
+
- Continue phase numbering in next milestone (never restart at 01)
|
|
123
|
+
</guidelines>
|
|
@@ -0,0 +1,115 @@
|
|
|
1
|
+
# Milestone Entry Template
|
|
2
|
+
|
|
3
|
+
Add this entry to `.planning/MILESTONES.md` when completing a milestone:
|
|
4
|
+
|
|
5
|
+
```markdown
|
|
6
|
+
## v[X.Y] [Name] (Shipped: YYYY-MM-DD)
|
|
7
|
+
|
|
8
|
+
**Delivered:** [One sentence describing what shipped]
|
|
9
|
+
|
|
10
|
+
**Phases completed:** [X-Y] ([Z] plans total)
|
|
11
|
+
|
|
12
|
+
**Key accomplishments:**
|
|
13
|
+
- [Major achievement 1]
|
|
14
|
+
- [Major achievement 2]
|
|
15
|
+
- [Major achievement 3]
|
|
16
|
+
- [Major achievement 4]
|
|
17
|
+
|
|
18
|
+
**Stats:**
|
|
19
|
+
- [X] files created/modified
|
|
20
|
+
- [Y] lines of code (primary language)
|
|
21
|
+
- [Z] phases, [N] plans, [M] tasks
|
|
22
|
+
- [D] days from start to ship (or milestone to milestone)
|
|
23
|
+
|
|
24
|
+
**Git range:** `feat(XX-XX)` → `feat(YY-YY)`
|
|
25
|
+
|
|
26
|
+
**What's next:** [Brief description of next milestone goals, or "Project complete"]
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
<structure>
|
|
32
|
+
If MILESTONES.md doesn't exist, create it with header:
|
|
33
|
+
|
|
34
|
+
```markdown
|
|
35
|
+
# Project Milestones: [Project Name]
|
|
36
|
+
|
|
37
|
+
[Entries in reverse chronological order - newest first]
|
|
38
|
+
```
|
|
39
|
+
</structure>
|
|
40
|
+
|
|
41
|
+
<guidelines>
|
|
42
|
+
**When to create milestones:**
|
|
43
|
+
- Initial v1.0 MVP shipped
|
|
44
|
+
- Major version releases (v2.0, v3.0)
|
|
45
|
+
- Significant feature milestones (v1.1, v1.2)
|
|
46
|
+
- Before archiving planning (capture what was shipped)
|
|
47
|
+
|
|
48
|
+
**Don't create milestones for:**
|
|
49
|
+
- Individual phase completions (normal workflow)
|
|
50
|
+
- Work in progress (wait until shipped)
|
|
51
|
+
- Minor bug fixes that don't constitute a release
|
|
52
|
+
|
|
53
|
+
**Stats to include:**
|
|
54
|
+
- Count modified files: `git diff --stat feat(XX-XX)..feat(YY-YY) | tail -1`
|
|
55
|
+
- Count LOC: `find . -name "*.swift" -o -name "*.ts" | xargs wc -l` (or relevant extension)
|
|
56
|
+
- Phase/plan/task counts from ROADMAP
|
|
57
|
+
- Timeline from first phase commit to last phase commit
|
|
58
|
+
|
|
59
|
+
**Git range format:**
|
|
60
|
+
- First commit of milestone → last commit of milestone
|
|
61
|
+
- Example: `feat(01-01)` → `feat(04-01)` for phases 1-4
|
|
62
|
+
</guidelines>
|
|
63
|
+
|
|
64
|
+
<example>
|
|
65
|
+
```markdown
|
|
66
|
+
# Project Milestones: WeatherBar
|
|
67
|
+
|
|
68
|
+
## v1.1 Security & Polish (Shipped: 2025-12-10)
|
|
69
|
+
|
|
70
|
+
**Delivered:** Security hardening with Keychain integration and comprehensive error handling
|
|
71
|
+
|
|
72
|
+
**Phases completed:** 5-6 (3 plans total)
|
|
73
|
+
|
|
74
|
+
**Key accomplishments:**
|
|
75
|
+
- Migrated API key storage from plaintext to macOS Keychain
|
|
76
|
+
- Implemented comprehensive error handling for network failures
|
|
77
|
+
- Added Sentry crash reporting integration
|
|
78
|
+
- Fixed memory leak in auto-refresh timer
|
|
79
|
+
|
|
80
|
+
**Stats:**
|
|
81
|
+
- 23 files modified
|
|
82
|
+
- 650 lines of Swift added
|
|
83
|
+
- 2 phases, 3 plans, 12 tasks
|
|
84
|
+
- 8 days from v1.0 to v1.1
|
|
85
|
+
|
|
86
|
+
**Git range:** `feat(05-01)` → `feat(06-02)`
|
|
87
|
+
|
|
88
|
+
**What's next:** v2.0 SwiftUI redesign with widget support
|
|
89
|
+
|
|
90
|
+
---
|
|
91
|
+
|
|
92
|
+
## v1.0 MVP (Shipped: 2025-11-25)
|
|
93
|
+
|
|
94
|
+
**Delivered:** Menu bar weather app with current conditions and 3-day forecast
|
|
95
|
+
|
|
96
|
+
**Phases completed:** 1-4 (7 plans total)
|
|
97
|
+
|
|
98
|
+
**Key accomplishments:**
|
|
99
|
+
- Menu bar app with popover UI (AppKit)
|
|
100
|
+
- OpenWeather API integration with auto-refresh
|
|
101
|
+
- Current weather display with conditions icon
|
|
102
|
+
- 3-day forecast list with high/low temperatures
|
|
103
|
+
- Code signed and notarized for distribution
|
|
104
|
+
|
|
105
|
+
**Stats:**
|
|
106
|
+
- 47 files created
|
|
107
|
+
- 2,450 lines of Swift
|
|
108
|
+
- 4 phases, 7 plans, 28 tasks
|
|
109
|
+
- 12 days from start to ship
|
|
110
|
+
|
|
111
|
+
**Git range:** `feat(01-01)` → `feat(04-01)`
|
|
112
|
+
|
|
113
|
+
**What's next:** Security audit and hardening for v1.1
|
|
114
|
+
```
|
|
115
|
+
</example>
|