ctx-cc 3.5.0 → 4.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +375 -676
- package/agents/ctx-arch-mapper.md +5 -3
- package/agents/ctx-auditor.md +5 -3
- package/agents/ctx-codex-reviewer.md +214 -0
- package/agents/ctx-concerns-mapper.md +5 -3
- package/agents/ctx-criteria-suggester.md +6 -4
- package/agents/ctx-debugger.md +5 -3
- package/agents/ctx-designer.md +488 -114
- package/agents/ctx-discusser.md +5 -3
- package/agents/ctx-executor.md +5 -3
- package/agents/ctx-handoff.md +6 -4
- package/agents/ctx-learner.md +5 -3
- package/agents/ctx-mapper.md +4 -3
- package/agents/ctx-ml-analyst.md +600 -0
- package/agents/ctx-ml-engineer.md +933 -0
- package/agents/ctx-ml-reviewer.md +485 -0
- package/agents/ctx-ml-scientist.md +626 -0
- package/agents/ctx-parallelizer.md +4 -3
- package/agents/ctx-planner.md +5 -3
- package/agents/ctx-predictor.md +4 -3
- package/agents/ctx-qa.md +5 -3
- package/agents/ctx-quality-mapper.md +5 -3
- package/agents/ctx-researcher.md +5 -3
- package/agents/ctx-reviewer.md +6 -4
- package/agents/ctx-team-coordinator.md +5 -3
- package/agents/ctx-tech-mapper.md +5 -3
- package/agents/ctx-verifier.md +5 -3
- package/bin/ctx.js +199 -27
- package/commands/brand.md +309 -0
- package/commands/ctx.md +10 -10
- package/commands/design.md +304 -0
- package/commands/experiment.md +251 -0
- package/commands/help.md +57 -7
- package/commands/init.md +25 -0
- package/commands/metrics.md +1 -1
- package/commands/milestone.md +1 -1
- package/commands/ml-status.md +197 -0
- package/commands/monitor.md +1 -1
- package/commands/train.md +266 -0
- package/commands/visual-qa.md +559 -0
- package/commands/voice.md +1 -1
- package/hooks/post-tool-use.js +39 -0
- package/hooks/pre-tool-use.js +94 -0
- package/hooks/subagent-stop.js +32 -0
- package/package.json +9 -3
- package/plugin.json +46 -0
- package/skills/ctx-design-system/SKILL.md +572 -0
- package/skills/ctx-ml-experiment/SKILL.md +334 -0
- package/skills/ctx-ml-pipeline/SKILL.md +437 -0
- package/skills/ctx-orchestrator/SKILL.md +91 -0
- package/skills/ctx-review-gate/SKILL.md +147 -0
- package/skills/ctx-state/SKILL.md +100 -0
- package/skills/ctx-visual-qa/SKILL.md +587 -0
- package/src/agents.js +109 -0
- package/src/auto.js +287 -0
- package/src/capabilities.js +226 -0
- package/src/commits.js +94 -0
- package/src/config.js +112 -0
- package/src/context.js +241 -0
- package/src/handoff.js +156 -0
- package/src/hooks.js +218 -0
- package/src/install.js +125 -50
- package/src/lifecycle.js +194 -0
- package/src/metrics.js +198 -0
- package/src/pipeline.js +269 -0
- package/src/review-gate.js +338 -0
- package/src/runner.js +120 -0
- package/src/skills.js +143 -0
- package/src/state.js +267 -0
- package/src/worktree.js +244 -0
- package/templates/PRD.json +1 -1
- package/templates/config.json +4 -237
- package/workflows/ctx-router.md +0 -485
- package/workflows/map-codebase.md +0 -329
|
@@ -0,0 +1,334 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ctx-ml-experiment
|
|
3
|
+
description: |
|
|
4
|
+
WHEN: User wants to run ML experiments, track hypotheses, compare models, analyze results, tune hyperparameters, or manage the experiment loop (hypothesize → design → implement → run → analyze → iterate).
|
|
5
|
+
WHEN NOT: Software development, UI design, non-ML coding tasks, infrastructure work unrelated to model training or evaluation.
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# CTX ML Experiment — Hypothesis-Driven Experiment Lifecycle
|
|
9
|
+
|
|
10
|
+
You manage the full ML experiment lifecycle: from hypothesis formation through result analysis and iteration. Every experiment is persisted, comparable, and traceable.
|
|
11
|
+
|
|
12
|
+
## Core Principle
|
|
13
|
+
|
|
14
|
+
No ad-hoc runs. Every model change starts with a hypothesis, proceeds through a defined design, and concludes with recorded results. The experiment log is the ground truth.
|
|
15
|
+
|
|
16
|
+
## Experiment Lifecycle
|
|
17
|
+
|
|
18
|
+
```
|
|
19
|
+
hypothesize → design → implement → run → analyze → iterate
|
|
20
|
+
↑ │
|
|
21
|
+
└────────── learn from results ──────────┘
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
Never skip phases. A run without a hypothesis is noise, not science.
|
|
25
|
+
|
|
26
|
+
## Directory Structure
|
|
27
|
+
|
|
28
|
+
Bootstrap `.ctx/ml/` if it does not exist:
|
|
29
|
+
|
|
30
|
+
```
|
|
31
|
+
.ctx/ml/
|
|
32
|
+
├── experiments/
|
|
33
|
+
│ ├── EXP-001/
|
|
34
|
+
│ │ ├── HYPOTHESIS.md # What we believe and why
|
|
35
|
+
│ │ ├── DESIGN.md # How to test it
|
|
36
|
+
│ │ ├── config.yaml # Reproducible run config
|
|
37
|
+
│ │ ├── RESULTS.md # Actual outcomes vs expected
|
|
38
|
+
│ │ └── artifacts/ # Model checkpoints, plots, logs
|
|
39
|
+
│ └── EXPERIMENT-LOG.md # Running table of all experiments
|
|
40
|
+
├── analysis/
|
|
41
|
+
│ ├── EDA-<dataset>.md # Exploratory data analysis
|
|
42
|
+
│ └── plots/
|
|
43
|
+
├── features/
|
|
44
|
+
│ ├── feature-registry.yaml # All features, transforms, lineage
|
|
45
|
+
│ └── transforms/ # Reusable transform definitions
|
|
46
|
+
├── models/
|
|
47
|
+
│ ├── registry.yaml # Model versions, metrics, lineage
|
|
48
|
+
│ └── configs/ # Canonical model configs
|
|
49
|
+
└── ML-STATUS.md # Current ML project status
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
## Experiment ID Convention
|
|
53
|
+
|
|
54
|
+
IDs are sequential: `EXP-001`, `EXP-002`, etc. Read the current max from `EXPERIMENT-LOG.md` and increment. Never reuse IDs.
|
|
55
|
+
|
|
56
|
+
## HYPOTHESIS.md Format
|
|
57
|
+
|
|
58
|
+
```markdown
|
|
59
|
+
# EXP-{id}: {Short Hypothesis Title}
|
|
60
|
+
|
|
61
|
+
**Created**: {ISO date}
|
|
62
|
+
**Author**: {agent or human}
|
|
63
|
+
**Status**: draft | running | concluded
|
|
64
|
+
|
|
65
|
+
## Hypothesis
|
|
66
|
+
|
|
67
|
+
{One sentence: "We believe that X will improve Y by Z because of W."}
|
|
68
|
+
|
|
69
|
+
## Rationale
|
|
70
|
+
|
|
71
|
+
- {Observation or data point that motivates this}
|
|
72
|
+
- {Prior experiment result that informs this}
|
|
73
|
+
- {Domain knowledge or literature support}
|
|
74
|
+
|
|
75
|
+
## Expected Outcome
|
|
76
|
+
|
|
77
|
+
- Primary metric: {metric} improves from {baseline} to {target}
|
|
78
|
+
- Guard rail: {metric} does not degrade below {threshold}
|
|
79
|
+
|
|
80
|
+
## Null Hypothesis
|
|
81
|
+
|
|
82
|
+
No meaningful difference between {treatment} and {control} on {metric}.
|
|
83
|
+
|
|
84
|
+
## Risk
|
|
85
|
+
|
|
86
|
+
{What would make this hypothesis wrong? What could go wrong?}
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## DESIGN.md Format
|
|
90
|
+
|
|
91
|
+
```markdown
|
|
92
|
+
# EXP-{id}: Experiment Design
|
|
93
|
+
|
|
94
|
+
## Setup
|
|
95
|
+
|
|
96
|
+
| Property | Value |
|
|
97
|
+
|----------|-------|
|
|
98
|
+
| Baseline | {model/config used as control} |
|
|
99
|
+
| Treatment | {what changes} |
|
|
100
|
+
| Dataset | {train/val/test splits, sizes} |
|
|
101
|
+
| Random seed | {seed value} |
|
|
102
|
+
|
|
103
|
+
## Changes from Baseline
|
|
104
|
+
|
|
105
|
+
- {File or config}: {change description}
|
|
106
|
+
|
|
107
|
+
## Metrics
|
|
108
|
+
|
|
109
|
+
| Metric | Direction | Threshold | Notes |
|
|
110
|
+
|--------|-----------|-----------|-------|
|
|
111
|
+
| {primary} | maximize | {value} | promotion gate |
|
|
112
|
+
| {guard} | minimize | {value} | regression gate |
|
|
113
|
+
|
|
114
|
+
## Evaluation Protocol
|
|
115
|
+
|
|
116
|
+
1. {Step 1}
|
|
117
|
+
2. {Step 2}
|
|
118
|
+
3. {Step 3}
|
|
119
|
+
|
|
120
|
+
## Acceptance Criteria
|
|
121
|
+
|
|
122
|
+
- [ ] Primary metric meets threshold
|
|
123
|
+
- [ ] Guard rail metrics not violated
|
|
124
|
+
- [ ] Training is stable (no loss explosion, no NaN)
|
|
125
|
+
- [ ] Inference latency within budget
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
## config.yaml Format
|
|
129
|
+
|
|
130
|
+
```yaml
|
|
131
|
+
experiment_id: EXP-001
|
|
132
|
+
hypothesis: "XGBoost with depth=6 outperforms depth=4 baseline"
|
|
133
|
+
created: "2026-03-25"
|
|
134
|
+
|
|
135
|
+
data:
|
|
136
|
+
train: data/train_v2.parquet
|
|
137
|
+
val: data/val_v2.parquet
|
|
138
|
+
test: data/test_v2.parquet
|
|
139
|
+
seed: 42
|
|
140
|
+
|
|
141
|
+
model:
|
|
142
|
+
type: xgboost
|
|
143
|
+
params:
|
|
144
|
+
max_depth: 6
|
|
145
|
+
n_estimators: 300
|
|
146
|
+
learning_rate: 0.05
|
|
147
|
+
subsample: 0.8
|
|
148
|
+
colsample_bytree: 0.8
|
|
149
|
+
|
|
150
|
+
evaluation:
|
|
151
|
+
primary_metric: auc
|
|
152
|
+
guard_metrics: [precision, recall]
|
|
153
|
+
cv_folds: 5
|
|
154
|
+
|
|
155
|
+
artifacts:
|
|
156
|
+
model_path: artifacts/model.pkl
|
|
157
|
+
plots: artifacts/plots/
|
|
158
|
+
logs: artifacts/train.log
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
## RESULTS.md Format
|
|
162
|
+
|
|
163
|
+
```markdown
|
|
164
|
+
# EXP-{id}: Results
|
|
165
|
+
|
|
166
|
+
**Concluded**: {ISO date}
|
|
167
|
+
**Status**: accepted | rejected | inconclusive
|
|
168
|
+
|
|
169
|
+
## Outcome
|
|
170
|
+
|
|
171
|
+
| Metric | Baseline | Result | Delta | Pass? |
|
|
172
|
+
|--------|----------|--------|-------|-------|
|
|
173
|
+
| {primary} | {value} | {value} | {+/-} | yes/no |
|
|
174
|
+
| {guard} | {value} | {value} | {+/-} | yes/no |
|
|
175
|
+
|
|
176
|
+
## Verdict
|
|
177
|
+
|
|
178
|
+
{accepted | rejected | inconclusive} — {One sentence reason}
|
|
179
|
+
|
|
180
|
+
## Key Findings
|
|
181
|
+
|
|
182
|
+
- {Finding 1}
|
|
183
|
+
- {Finding 2}
|
|
184
|
+
|
|
185
|
+
## What This Tells Us
|
|
186
|
+
|
|
187
|
+
{2-3 sentences on what we learned, not just the metric deltas.}
|
|
188
|
+
|
|
189
|
+
## Next Experiment
|
|
190
|
+
|
|
191
|
+
{What should EXP-{n+1} test, given these results?}
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
## EXPERIMENT-LOG.md Format
|
|
195
|
+
|
|
196
|
+
```markdown
|
|
197
|
+
# ML Experiment Log
|
|
198
|
+
|
|
199
|
+
| ID | Hypothesis | Model | Primary Metric | Result | Status |
|
|
200
|
+
|----|-----------|-------|---------------|--------|--------|
|
|
201
|
+
| EXP-001 | XGBoost > baseline | XGBoost depth=4 | AUC 0.85 target | AUC 0.82 | rejected |
|
|
202
|
+
| EXP-002 | Feature X improves AUC | XGBoost+feat_x | AUC 0.87 target | AUC 0.87 | accepted |
|
|
203
|
+
| EXP-003 | HPO on accepted config | XGBoost+feat_x tuned | AUC 0.90 target | running | running |
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
Always append — never delete rows. Mark rejected experiments as `rejected`, not removed.
|
|
207
|
+
|
|
208
|
+
## Model Registry Format (models/registry.yaml)
|
|
209
|
+
|
|
210
|
+
```yaml
|
|
211
|
+
models:
|
|
212
|
+
risk-classifier:
|
|
213
|
+
current: v3
|
|
214
|
+
versions:
|
|
215
|
+
v1:
|
|
216
|
+
metrics: { auc: 0.82, precision: 0.79 }
|
|
217
|
+
experiment: EXP-001
|
|
218
|
+
date: "2026-01-15"
|
|
219
|
+
status: retired
|
|
220
|
+
artifacts: .ctx/ml/experiments/EXP-001/artifacts/
|
|
221
|
+
v2:
|
|
222
|
+
metrics: { auc: 0.87, precision: 0.84 }
|
|
223
|
+
experiment: EXP-002
|
|
224
|
+
date: "2026-02-01"
|
|
225
|
+
status: retired
|
|
226
|
+
artifacts: .ctx/ml/experiments/EXP-002/artifacts/
|
|
227
|
+
v3:
|
|
228
|
+
metrics: { auc: 0.91, precision: 0.88 }
|
|
229
|
+
experiment: EXP-005
|
|
230
|
+
date: "2026-03-10"
|
|
231
|
+
status: production
|
|
232
|
+
artifacts: .ctx/ml/experiments/EXP-005/artifacts/
|
|
233
|
+
promotion_criteria:
|
|
234
|
+
primary: "auc >= current + 0.02"
|
|
235
|
+
guard: "precision regression <= 0.01"
|
|
236
|
+
stability: "training loss converges within 50 epochs"
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
## Feature Registry Format (features/feature-registry.yaml)
|
|
240
|
+
|
|
241
|
+
```yaml
|
|
242
|
+
features:
|
|
243
|
+
age_z_score:
|
|
244
|
+
type: numeric
|
|
245
|
+
source: demographics
|
|
246
|
+
transform: "(age - mean) / std"
|
|
247
|
+
version: 1
|
|
248
|
+
created: "2026-01-15"
|
|
249
|
+
used_by: [risk-classifier, bio-age-model]
|
|
250
|
+
validated: true
|
|
251
|
+
notes: "mean=45.2, std=12.1 from train_v1"
|
|
252
|
+
|
|
253
|
+
cholesterol_ratio:
|
|
254
|
+
type: numeric
|
|
255
|
+
source: labs
|
|
256
|
+
transform: "hdl / ldl"
|
|
257
|
+
version: 2
|
|
258
|
+
created: "2026-02-10"
|
|
259
|
+
used_by: [risk-classifier]
|
|
260
|
+
validated: true
|
|
261
|
+
notes: "v2 clips outliers at 99th percentile"
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
## ML-STATUS.md Format
|
|
265
|
+
|
|
266
|
+
```markdown
|
|
267
|
+
# ML Project Status
|
|
268
|
+
|
|
269
|
+
**Updated**: {ISO date}
|
|
270
|
+
**Active Experiment**: EXP-{n} — {hypothesis title}
|
|
271
|
+
|
|
272
|
+
## Current Focus
|
|
273
|
+
|
|
274
|
+
{1-2 sentences on what we are trying to solve right now}
|
|
275
|
+
|
|
276
|
+
## Recent Results
|
|
277
|
+
|
|
278
|
+
| EXP | Outcome | Key Learning |
|
|
279
|
+
|-----|---------|--------------|
|
|
280
|
+
| EXP-{n-1} | accepted | {learning} |
|
|
281
|
+
| EXP-{n-2} | rejected | {learning} |
|
|
282
|
+
|
|
283
|
+
## Blocking Issues
|
|
284
|
+
|
|
285
|
+
- {Issue if any, else "none"}
|
|
286
|
+
|
|
287
|
+
## Next Experiments Queued
|
|
288
|
+
|
|
289
|
+
1. {EXP-{n+1}} — {hypothesis}
|
|
290
|
+
2. {EXP-{n+2}} — {hypothesis}
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
## Workflow Rules
|
|
294
|
+
|
|
295
|
+
1. Every experiment gets its own numbered directory. No experiments in ad-hoc files.
|
|
296
|
+
2. `config.yaml` must be committed before running. Results are not reproducible without it.
|
|
297
|
+
3. Record results even for rejected experiments. Negative results have value.
|
|
298
|
+
4. Update `EXPERIMENT-LOG.md` after every concluded experiment — keep it current.
|
|
299
|
+
5. Update `ML-STATUS.md` whenever the active experiment changes.
|
|
300
|
+
6. Never promote a model to registry without a concluded RESULTS.md with accepted status.
|
|
301
|
+
7. Feature registry is append-only with versioning. Old feature versions stay in the file.
|
|
302
|
+
|
|
303
|
+
## Agent Spawn Patterns
|
|
304
|
+
|
|
305
|
+
When orchestrating experiment work, spawn agents appropriate to the phase:
|
|
306
|
+
|
|
307
|
+
```
|
|
308
|
+
Phase: hypothesize → spawn ctx-ml-scientist
|
|
309
|
+
prompt: "Form a hypothesis for EXP-{n}. Context: {prior results}. Write HYPOTHESIS.md."
|
|
310
|
+
|
|
311
|
+
Phase: design → spawn ctx-ml-scientist
|
|
312
|
+
prompt: "Design the experiment for EXP-{n}. Write DESIGN.md and config.yaml."
|
|
313
|
+
|
|
314
|
+
Phase: implement → spawn ctx-ml-engineer
|
|
315
|
+
prompt: "Implement training script for EXP-{n} using config.yaml."
|
|
316
|
+
|
|
317
|
+
Phase: analyze → spawn ctx-ml-analyst
|
|
318
|
+
prompt: "Analyze results from EXP-{n} run. Write RESULTS.md. Update EXPERIMENT-LOG.md."
|
|
319
|
+
|
|
320
|
+
Phase: review → spawn ctx-ml-reviewer
|
|
321
|
+
prompt: "Review EXP-{n} results. Recommend: accept, reject, or run follow-up."
|
|
322
|
+
```
|
|
323
|
+
|
|
324
|
+
## Iteration Decision Tree
|
|
325
|
+
|
|
326
|
+
After `analyze`:
|
|
327
|
+
|
|
328
|
+
```
|
|
329
|
+
Result meets primary metric AND guards hold?
|
|
330
|
+
YES → Accept. Update model registry. Queue follow-up if headroom remains.
|
|
331
|
+
NO (primary miss) → Reject. Document learnings. Form new hypothesis.
|
|
332
|
+
NO (guard violated) → Inconclusive. Address regression before retrying.
|
|
333
|
+
Unclear → Run validation on holdout set before deciding.
|
|
334
|
+
```
|