@zigrivers/scaffold 3.10.1 → 3.12.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +7 -5
- package/content/knowledge/core/automated-review-tooling.md +137 -140
- package/content/knowledge/core/multi-model-research-dispatch.md +219 -0
- package/content/knowledge/core/multi-model-review-dispatch.md +47 -6
- package/content/knowledge/game/game-ideation.md +100 -0
- package/content/knowledge/product/ideation-craft.md +209 -0
- package/content/methodology/game-overlay.yml +2 -0
- package/content/pipeline/foundation/tech-stack.md +1 -0
- package/content/pipeline/vision/create-vision.md +43 -0
- package/content/skills/multi-model-dispatch/SKILL.md +20 -22
- package/content/tools/post-implementation-review.md +71 -26
- package/content/tools/prompt-pipeline.md +1 -0
- package/content/tools/review-code.md +37 -11
- package/content/tools/review-pr.md +65 -23
- package/content/tools/spark.md +337 -0
- package/package.json +1 -1
- package/skills/multi-model-dispatch/SKILL.md +20 -22
package/README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# Scaffold
|
|
2
2
|
|
|
3
|
-
A TypeScript CLI that assembles AI-powered prompts at runtime to guide you from "I have an idea" to working software. Scaffold walks you through 60 structured pipeline steps — organized into 16 phases — plus
|
|
3
|
+
A TypeScript CLI that assembles AI-powered prompts at runtime to guide you from "I have an idea" to working software. Scaffold walks you through 60 structured pipeline steps — organized into 16 phases — plus 11 utility tools, and the supported AI tools handle the research, planning, and implementation for you.
|
|
4
4
|
|
|
5
5
|
By the end, you'll have a fully planned, standards-documented, implementation-ready project with working code.
|
|
6
6
|
|
|
@@ -38,7 +38,7 @@ Either way, Scaffold constructs the prompt and the target AI tool does the work.
|
|
|
38
38
|
|
|
39
39
|
**Depth scale** (1-5) — Controls how thorough each step's output is, from "focus on the core deliverable" (1) to "explore all angles, tradeoffs, and edge cases" (5). Depth resolves with 4-level precedence: CLI flag > step override > custom default > preset default.
|
|
40
40
|
|
|
41
|
-
**Multi-model validation** — At depth 4-5, all 19 review and validation steps can dispatch independent reviews to Codex and/or Gemini CLIs. Two independent models catch more blind spots than one. When both CLIs are available, findings are reconciled by confidence level (both agree = high confidence, single model P0 = still actionable).
|
|
41
|
+
**Multi-model validation** — At depth 4-5, all 19 review and validation steps can dispatch independent reviews to Codex and/or Gemini CLIs. Two independent models catch more blind spots than one. When both CLIs are available, findings are reconciled by confidence level (both agree = high confidence, single model P0 = still actionable). When a channel is unavailable, a compensating Claude self-review pass runs in its place (labeled `[compensating: Codex-equivalent]` or `[compensating: Gemini-equivalent]`, single-source confidence). CLI commands must always run in the foreground — background execution produces empty output. See the [Multi-Model Review](#multi-model-review) section.
|
|
42
42
|
|
|
43
43
|
**State management** — Pipeline progress is tracked in `.scaffold/state.json` with atomic file writes and crash recovery. An advisory lock prevents concurrent runs. Decisions are logged to an append-only `decisions.jsonl`.
|
|
44
44
|
|
|
@@ -931,12 +931,13 @@ mmr review --pr 47 ──→ Dispatches to all channels in background
|
|
|
931
931
|
Agent continues working
|
|
932
932
|
|
|
933
933
|
mmr status mmr-a1b2c3 ──→ Poll progress (which channels done?)
|
|
934
|
-
Exit code: 0=done, 1=running,
|
|
934
|
+
Exit code: 0=done, 1=running, 4=failed
|
|
935
935
|
|
|
936
936
|
mmr results mmr-a1b2c3 ──→ Reconcile findings across channels
|
|
937
|
+
Run compensating passes for unavailable channels
|
|
937
938
|
Apply severity gate
|
|
938
939
|
Output unified findings
|
|
939
|
-
Exit code: 0=passed,
|
|
940
|
+
Exit code: 0=passed, 2=gate failed, 3=degraded
|
|
940
941
|
```
|
|
941
942
|
|
|
942
943
|
**Key features:**
|
|
@@ -1381,9 +1382,10 @@ These are orthogonal to the pipeline — usable at any time, not tied to pipelin
|
|
|
1381
1382
|
| `scaffold run review-code` | Run all 3 code review channels on local code before commit or push. |
|
|
1382
1383
|
| `scaffold run review-pr` | Run all 3 code review channels (Codex CLI, Gemini CLI, Superpowers) on a PR. |
|
|
1383
1384
|
| `scaffold run post-implementation-review` | Full 3-channel codebase review after an AI agent completes all tasks — checks requirements coverage, security, architecture alignment, and more. |
|
|
1385
|
+
| `scaffold run spark` | Explore and expand a raw project idea through Socratic questioning, competitive research, and innovation expansion. Produces a `docs/spark-brief.md` that feeds into `create-vision`. At depth 4+, dispatches to external models for independent research and adversarial red-teaming. |
|
|
1384
1386
|
| `scaffold run session-analyzer` | Analyze Claude Code session logs for patterns and insights. |
|
|
1385
1387
|
|
|
1386
|
-
Use `scaffold run review-code` before commit or push when you want a local gate on the current delivery candidate. Use `scaffold run review-pr` after a GitHub PR exists.
|
|
1388
|
+
Use `scaffold run spark` before `create-vision` when you have a vague idea that needs sharpening. Use `scaffold run review-code` before commit or push when you want a local gate on the current delivery candidate. Use `scaffold run review-pr` after a GitHub PR exists.
|
|
1387
1389
|
|
|
1388
1390
|
Run any of these via the CLI or ask the scaffold runner skill in Claude Code or Gemini.
|
|
1389
1391
|
|
|
@@ -10,194 +10,191 @@ Automated PR review leverages AI models to provide consistent, thorough code rev
|
|
|
10
10
|
|
|
11
11
|
## Summary
|
|
12
12
|
|
|
13
|
-
###
|
|
13
|
+
### Review Severity and Reconciliation
|
|
14
14
|
|
|
15
|
-
|
|
16
|
-
- **No CI secrets required** — models run locally via CLI tools
|
|
17
|
-
- **Dual-model review** — run Codex and Gemini (when available) for independent perspectives
|
|
18
|
-
- **Agent-managed loop** — Claude orchestrates the review-fix cycle locally
|
|
15
|
+
See `review-methodology` for severity definitions (P0-P3). See `multi-model-review-dispatch` for finding reconciliation rules.
|
|
19
16
|
|
|
20
|
-
|
|
21
|
-
- `AGENTS.md` — reviewer instructions with project-specific rules
|
|
22
|
-
- `docs/review-standards.md` — severity definitions (P0-P3) and criteria
|
|
23
|
-
- `scripts/cli-pr-review.sh` — dual-model review script
|
|
24
|
-
- `scripts/await-pr-review.sh` — polling script for external bot mode
|
|
17
|
+
**Action thresholds:** P0/P1/P2 findings must be fixed before proceeding to the next task. P3 findings are recorded but not actioned.
|
|
25
18
|
|
|
26
|
-
###
|
|
19
|
+
### Degraded-Mode Behavior
|
|
27
20
|
|
|
28
|
-
|
|
29
|
-
- **P0 (blocking)** — must fix before merge (security, data loss, broken functionality)
|
|
30
|
-
- **P1 (important)** — should fix before merge (bugs, missing tests, performance)
|
|
31
|
-
- **P2 (suggestion)** — consider fixing (style, naming, documentation)
|
|
32
|
-
- **P3 (nit)** — optional (personal preference, minor optimization)
|
|
21
|
+
#### Verdict Definitions
|
|
33
22
|
|
|
34
|
-
|
|
23
|
+
These are the authoritative verdict definitions. Tool files (`review-code.md`, `review-pr.md`) reference these.
|
|
35
24
|
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
25
|
+
| Verdict | Condition |
|
|
26
|
+
|---------|-----------|
|
|
27
|
+
| `pass` | All configured channels ran, no unresolved P0/P1/P2 |
|
|
28
|
+
| `degraded-pass` | Channels skipped, compensated, or have non-full coverage (e.g., partial timeout), no unresolved P0/P1/P2 |
|
|
29
|
+
| `blocked` | Unresolved P0/P1/P2 after 3 fix rounds |
|
|
30
|
+
| `needs-user-decision` | Contradictions or unresolvable findings |
|
|
41
31
|
|
|
42
|
-
|
|
32
|
+
**Verdict precedence:** `needs-user-decision` > `blocked` > `degraded-pass` > `pass`. When multiple conditions apply, the higher-precedence verdict wins.
|
|
43
33
|
|
|
44
|
-
|
|
45
|
-
1. Agent creates PR
|
|
46
|
-
2. Agent runs `scripts/cli-pr-review.sh` (or review runs automatically)
|
|
47
|
-
3. Review findings are posted as PR comments or written to a local file
|
|
48
|
-
4. Agent addresses P0/P1/P2 findings, pushes fixes
|
|
49
|
-
5. Re-review until no P0/P1/P2 findings remain
|
|
50
|
-
6. PR is ready for merge
|
|
34
|
+
**Both external channels missing:** Maximum achievable verdict is `degraded-pass` — never `pass`. Review summary must note: "All findings are single-model (Claude only). External validation was unavailable."
|
|
51
35
|
|
|
52
|
-
|
|
36
|
+
#### Status Model
|
|
53
37
|
|
|
54
|
-
|
|
38
|
+
`compensating` is a **coverage label** applied to a channel's output, not a replacement for the root-cause status. Each channel retains its root-cause status (`not_installed`, `auth_failed`, `auth_timeout`, `failed`) AND gains a coverage label (`compensating (X-equivalent)`) when a compensating pass ran. The fix cycle uses the **root-cause status** to decide whether to retry (never retry `not_installed`, `auth_failed`, `auth_timeout`). The report uses the **coverage label** to show the reader what ran.
|
|
55
39
|
|
|
56
|
-
|
|
40
|
+
#### Compensating Passes
|
|
57
41
|
|
|
58
|
-
|
|
59
|
-
# Code Review Instructions
|
|
42
|
+
When an external channel (Codex or Gemini) is unavailable, run a compensating Claude self-review pass:
|
|
60
43
|
|
|
61
|
-
|
|
62
|
-
[
|
|
44
|
+
- Same prompt structure as the missing channel, executed as a Claude self-review pass.
|
|
45
|
+
- Labeled `[compensating: Codex-equivalent]` or `[compensating: Gemini-equivalent]` in the review summary.
|
|
46
|
+
- Missing Codex → focus on implementation correctness, security, API contracts.
|
|
47
|
+
- Missing Gemini → focus on architectural patterns, design reasoning, broad context.
|
|
48
|
+
- Missing both → two compensating passes (one per missing channel's strength area).
|
|
49
|
+
- Compensating-pass findings are **single-source confidence** — they do NOT raise to high confidence even if they agree with another channel's findings.
|
|
50
|
+
- Normal mandatory-fix thresholds apply: P0/P1/P2 findings from compensating passes still require fixing.
|
|
63
51
|
|
|
64
|
-
|
|
65
|
-
- Security: [project-specific security concerns]
|
|
66
|
-
- Performance: [known hot paths or constraints]
|
|
67
|
-
- Testing: [coverage requirements, test patterns]
|
|
52
|
+
**Superpowers channel:** No compensating pass needed — Superpowers is a Claude subagent and is always available. If the Superpowers plugin is not installed, run available external CLIs and warn the user that review coverage is reduced.
|
|
68
53
|
|
|
69
|
-
|
|
70
|
-
See docs/coding-standards.md for:
|
|
71
|
-
- Naming conventions
|
|
72
|
-
- Error handling patterns
|
|
73
|
-
- Logging standards
|
|
54
|
+
#### Foreground-Only Execution
|
|
74
55
|
|
|
75
|
-
|
|
76
|
-
[Project-specific patterns reviewers should enforce]
|
|
56
|
+
Always run Codex and Gemini CLI commands as foreground Bash calls. Never use `run_in_background`, `&`, or `nohup`. Background execution produces empty or truncated output from Codex and Gemini CLIs. Multiple foreground calls can still run in parallel if the tool runner supports parallel tool invocations.
|
|
77
57
|
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
58
|
+
This constraint is intentionally duplicated from `multi-model-review-dispatch`. Knowledge entries are injected independently by the assembly engine — an agent may receive this entry without `multi-model-review-dispatch`, so both need the constraint.
|
|
59
|
+
|
|
60
|
+
## Deep Guidance
|
|
61
|
+
|
|
62
|
+
### Finding Reconciliation
|
|
63
|
+
|
|
64
|
+
After all channels complete (including compensating passes), reconcile findings using the rules in `multi-model-review-dispatch`. This orchestration entry triggers reconciliation; the dispatch entry defines how to perform it.
|
|
81
65
|
|
|
82
|
-
|
|
66
|
+
Reconciliation normalizes findings from all channels (real and compensating) to a common schema, then matches findings across channels by location and category. The purpose is to detect when multiple independent channels agree on a finding (raising confidence) and to surface contradictions that require human judgment. A finding reported by Codex alone has lower confidence than the same finding reported by both Codex and Gemini.
|
|
83
67
|
|
|
84
|
-
The
|
|
68
|
+
The reconciliation output is a deduplicated list of findings with confidence scores. High-confidence findings (agreed by 2+ real channels) are actionable without further discussion. Low-confidence findings (single-source, or from compensating passes) still require action at P0/P1/P2 but should be noted as lower-confidence in the review summary.
|
|
69
|
+
|
|
70
|
+
Findings that appear in all three channels (Codex, Gemini, Superpowers) are considered maximum-confidence and should be surfaced first in the review summary. Findings that appear in only one channel should include the channel name in the finding description to help the developer assess confidence independently.
|
|
85
71
|
|
|
86
72
|
```bash
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
#
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
# 2. Run Codex review (if available)
|
|
94
|
-
if command -v codex &>/dev/null; then
|
|
95
|
-
codex_findings=$(echo "$diff" | codex review --context AGENTS.md)
|
|
96
|
-
fi
|
|
97
|
-
|
|
98
|
-
# 3. Run Gemini review (if available)
|
|
99
|
-
if command -v gemini &>/dev/null; then
|
|
100
|
-
gemini_findings=$(echo "$diff" | gemini review --context AGENTS.md)
|
|
101
|
-
fi
|
|
102
|
-
|
|
103
|
-
# 4. Reconcile findings
|
|
104
|
-
# - Findings from both models: HIGH confidence
|
|
105
|
-
# - Findings from one model: MEDIUM confidence
|
|
106
|
-
# - Contradictions: flagged for human review
|
|
73
|
+
# Orchestration reconciliation workflow
|
|
74
|
+
# 1. Collect findings from all channels (real + compensating)
|
|
75
|
+
# 2. Normalize to common schema (severity, category, location, description)
|
|
76
|
+
# 3. Match findings across channels by location + category
|
|
77
|
+
# 4. Apply consensus rules from multi-model-review-dispatch
|
|
78
|
+
# 5. Produce reconciled findings list with confidence scores
|
|
107
79
|
```
|
|
108
80
|
|
|
109
|
-
###
|
|
81
|
+
### Channel Dispatch Pattern and Orchestration
|
|
110
82
|
|
|
111
|
-
|
|
112
|
-
- Severity levels with concrete examples per project
|
|
113
|
-
- What constitutes a blocking review (P0/P1/P2 threshold)
|
|
114
|
-
- Auto-approve criteria (when review can be skipped)
|
|
115
|
-
- Review SLA (how long before auto-approve kicks in)
|
|
83
|
+
Each external channel (Codex, Gemini) follows the same dispatch pattern: check installation, check auth, then dispatch as a foreground call. If any step fails, record the root-cause status, queue a compensating pass, and continue to the next channel. The Superpowers channel is always available as a Claude subagent and does not require installation or auth checks.
|
|
116
84
|
|
|
117
|
-
|
|
85
|
+
```bash
|
|
86
|
+
# Channel dispatch pattern
|
|
87
|
+
# For each external channel (Codex, Gemini):
|
|
88
|
+
# 1. command -v <tool> >/dev/null 2>&1 || { status=not_installed; queue_compensating; continue; }
|
|
89
|
+
# 2. <auth_check> || { status=auth_failed; queue_compensating; continue; }
|
|
90
|
+
# 3. <dispatch_foreground> || { status=failed; queue_compensating; continue; }
|
|
91
|
+
# For Superpowers: dispatch subagent (always available)
|
|
92
|
+
# After all: run queued compensating passes → reconcile → verdict
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
After all channels and compensating passes complete, run the reconciliation workflow above and apply the verdict decision flow. Channel results and compensating-pass labels must be preserved in the review output for auditability — do not collapse or omit them even when findings are empty.
|
|
96
|
+
|
|
97
|
+
### Degraded-Mode Worked Example
|
|
118
98
|
|
|
119
|
-
|
|
120
|
-
1. Claude performs an enhanced self-review of the diff
|
|
121
|
-
2. Focus on the AGENTS.md review criteria
|
|
122
|
-
3. Apply the same severity classification
|
|
123
|
-
4. Document that the review was single-model
|
|
99
|
+
When Codex is unavailable (not installed or auth failure), the orchestration proceeds as follows:
|
|
124
100
|
|
|
125
|
-
|
|
101
|
+
1. The installation check (`command -v codex`) fails. Codex channel status is set to `not_installed`.
|
|
102
|
+
2. A compensating Codex-equivalent pass is queued: a Claude self-review focused on implementation correctness, security, and API contracts.
|
|
103
|
+
3. Gemini and Superpowers channels run normally.
|
|
104
|
+
4. The compensating pass runs, producing findings labeled `[compensating: Codex-equivalent]`.
|
|
105
|
+
5. Reconciliation merges findings from all three sources (Gemini, Superpowers, compensating-Codex).
|
|
106
|
+
6. Maximum achievable verdict is `degraded-pass` because a real channel was absent.
|
|
107
|
+
7. The review summary notes: "Codex channel: not_installed (compensating: Codex-equivalent pass ran)."
|
|
126
108
|
|
|
127
|
-
|
|
128
|
-
- Add new review focus areas when new patterns emerge
|
|
129
|
-
- Remove rules that linters now enforce automatically
|
|
130
|
-
- Update AGENTS.md when architecture changes
|
|
131
|
-
- Track false-positive rates and adjust thresholds
|
|
109
|
+
**Fix-cycle channel rule:** Only re-run channels that originally completed or ran as compensating passes. `failed` channels are covered by their compensating pass and are not retried during fix rounds. Never retry a channel with status `not_installed`, `auth_failed`, or `auth_timeout` — these indicate persistent environment conditions that will not resolve between fix rounds.
|
|
132
110
|
|
|
133
|
-
###
|
|
111
|
+
### Verdict Decision Flow
|
|
134
112
|
|
|
135
|
-
|
|
113
|
+
Apply the following evaluation order to determine the final verdict. The first matching condition wins; all subsequent conditions are skipped.
|
|
136
114
|
|
|
137
115
|
```
|
|
138
|
-
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
│ Unique finding │ Found │ - │ MEDIUM confidence │
|
|
144
|
-
│ Unique finding │ - │ Found │ MEDIUM confidence │
|
|
145
|
-
│ Contradiction │ Fix X │ Keep X │ Flag for agent │
|
|
146
|
-
└─────────────────┴──────────┴──────────┴───────────────────┘
|
|
116
|
+
Verdict evaluation order:
|
|
117
|
+
1. Any contradictions or unresolvable findings? → needs-user-decision
|
|
118
|
+
2. Any unresolved P0/P1/P2 after 3 fix rounds? → blocked
|
|
119
|
+
3. Any channel not at full coverage? → degraded-pass
|
|
120
|
+
4. All channels completed, no unresolved P0/P1/P2? → pass
|
|
147
121
|
```
|
|
148
122
|
|
|
149
|
-
|
|
123
|
+
A "contradiction" exists when two channels report opposite conclusions about the same code location — for example, Codex flags a function as insecure while Gemini explicitly approves it. Contradictions cannot be resolved by the agent alone and must be surfaced to the user.
|
|
124
|
+
|
|
125
|
+
A channel is "not at full coverage" when: it ran as a compensating pass instead of a real tool, it timed out partially, or the Superpowers plugin is not installed and available channels do not cover the full diff.
|
|
126
|
+
|
|
127
|
+
**Verdict precedence reminder:** `needs-user-decision` > `blocked` > `degraded-pass` > `pass`. If multiple conditions apply simultaneously (for example, both a contradiction and an unresolved P0 exist), the higher-precedence verdict wins.
|
|
128
|
+
|
|
129
|
+
The verdict is always computed after all fix rounds are exhausted — do not emit a partial verdict mid-cycle. If a fix round resolves all P0/P1/P2 findings and no contradictions remain, the verdict upgrades from `blocked` to `pass` or `degraded-pass` depending on channel coverage. This upgrade must be verified explicitly by re-running the reconciliation step after each fix round, not assumed from the fact that fixes were applied.
|
|
150
130
|
|
|
151
131
|
### Security-Focused Review Checklist
|
|
152
132
|
|
|
153
133
|
Every automated review should check:
|
|
154
|
-
- No secrets or credentials in the diff (API keys, passwords, tokens)
|
|
155
|
-
- No `eval()` or equivalent unsafe operations introduced
|
|
156
|
-
- SQL queries use parameterized queries
|
|
157
|
-
- User input is validated before use
|
|
158
|
-
- Authentication/authorization checks are present on new endpoints
|
|
159
|
-
- Dependencies added are from trusted sources with known versions
|
|
134
|
+
- No secrets or credentials in the diff (API keys, passwords, tokens, private keys)
|
|
135
|
+
- No `eval()` or equivalent unsafe operations introduced (dynamic code execution, shell injection)
|
|
136
|
+
- SQL queries use parameterized queries — no string concatenation with user input
|
|
137
|
+
- User input is validated and sanitized before use in queries, commands, or output
|
|
138
|
+
- Authentication/authorization checks are present on all new endpoints and operations
|
|
139
|
+
- Dependencies added are from trusted sources with known, pinned versions
|
|
140
|
+
- No new global state or singletons that could cause cross-request data leaks
|
|
141
|
+
- Error messages do not expose internal paths, stack traces, or sensitive system details
|
|
142
|
+
- File system operations use safe path handling (no path traversal vulnerabilities)
|
|
143
|
+
- Cryptographic operations use approved algorithms and key lengths
|
|
144
|
+
|
|
145
|
+
When reviewing diffs that touch authentication, authorization, or data handling, elevate any security-related finding by one severity level. A finding that would normally be P2 (recommended) becomes P1 (required) in security-sensitive code paths. This conservative stance reflects the asymmetric cost of security failures versus the cost of over-caution during review.
|
|
160
146
|
|
|
161
147
|
### Performance Review Patterns
|
|
162
148
|
|
|
163
|
-
Look for these performance anti-patterns:
|
|
164
|
-
- N+1 queries (loop
|
|
165
|
-
- Missing pagination on list endpoints
|
|
166
|
-
- Synchronous operations that should be async
|
|
167
|
-
- Large objects passed by value instead of reference
|
|
168
|
-
- Missing caching for expensive computations
|
|
169
|
-
- Unbounded growth in arrays or maps
|
|
149
|
+
Look for these performance anti-patterns in the diff:
|
|
150
|
+
- N+1 queries (loop containing individual DB calls — use batch queries or eager loading)
|
|
151
|
+
- Missing pagination on list endpoints (unbounded result sets)
|
|
152
|
+
- Synchronous operations that should be async (blocking I/O in hot paths)
|
|
153
|
+
- Large objects passed by value instead of reference (unnecessary deep copies)
|
|
154
|
+
- Missing caching for expensive computations that are called repeatedly
|
|
155
|
+
- Unbounded growth in arrays or maps (no eviction, no size limits)
|
|
156
|
+
- Missing indexes on columns used in WHERE clauses of new queries
|
|
157
|
+
- Eager loading where lazy loading would suffice (over-fetching)
|
|
158
|
+
- Missing connection pooling or connection reuse for external services
|
|
170
159
|
|
|
171
|
-
###
|
|
160
|
+
### Common False Positives
|
|
172
161
|
|
|
173
|
-
|
|
162
|
+
Track and suppress recurring false positives to reduce noise in future reviews:
|
|
163
|
+
- Test files flagged for "hardcoded values" (test fixtures and expected values are intentional)
|
|
164
|
+
- Migration files flagged for "raw SQL" (migrations must use raw SQL for schema changes)
|
|
165
|
+
- Generated files flagged for style issues (generated code follows its own generator's conventions)
|
|
166
|
+
- Intentional use of `any` types in TypeScript adapter layers or third-party type overrides
|
|
167
|
+
- Deliberate `eslint-disable` comments that are already justified in surrounding context
|
|
168
|
+
- Seed data files flagged for hardcoded credentials (test-only, not production)
|
|
174
169
|
|
|
175
|
-
|
|
176
|
-
## Code Review
|
|
177
|
-
| Command | Purpose |
|
|
178
|
-
|---------|---------|
|
|
179
|
-
| `scripts/cli-pr-review.sh <PR#>` | Run dual-model review |
|
|
180
|
-
| `scripts/await-pr-review.sh <PR#>` | Poll for external review |
|
|
181
|
-
```
|
|
170
|
+
Add suppressions to AGENTS.md under "Out of Scope" to prevent repeated false findings across review cycles.
|
|
182
171
|
|
|
183
|
-
|
|
172
|
+
### Review Metrics and Continuous Improvement
|
|
184
173
|
|
|
185
|
-
|
|
174
|
+
Track these metrics over time to improve review quality and calibrate thresholds:
|
|
186
175
|
|
|
187
|
-
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
|
|
176
|
+
| Metric | Definition | Use |
|
|
177
|
+
|--------|------------|-----|
|
|
178
|
+
| False positive rate | Findings dismissed without action / total findings | Calibrate severity thresholds |
|
|
179
|
+
| Escape rate | Bugs reaching production despite review / total bugs | Identify coverage gaps |
|
|
180
|
+
| Time to resolve | Average time between finding logged and fix merged | Identify bottlenecks |
|
|
181
|
+
| Coverage | PRs receiving automated review / total PRs merged | Track adoption |
|
|
182
|
+
| Model agreement rate | Findings agreed by 2+ channels / total findings | Tune reconciliation rules |
|
|
183
|
+
| Compensating-pass rate | Reviews using compensating passes / total reviews | Track environment health |
|
|
191
184
|
|
|
192
|
-
|
|
185
|
+
Use the false positive rate to determine whether a severity category is over-triggering. Use the escape rate to determine whether the review is missing entire classes of bugs. Use the compensating-pass rate to identify when the review environment needs maintenance (expired auth tokens, broken CLI installs).
|
|
193
186
|
|
|
194
|
-
|
|
187
|
+
Log metric snapshots in AGENTS.md after each major project milestone. A declining model agreement rate over time suggests either that the review prompts are drifting in quality or that the codebase is accumulating technical debt in areas where models diverge. A rising escape rate despite consistent review coverage is a signal to revisit the severity thresholds or the focus areas in the review prompts.
|
|
188
|
+
|
|
189
|
+
### Fallback When Models Unavailable
|
|
190
|
+
|
|
191
|
+
When external CLIs are unavailable, the degraded-mode behavior defined in the Summary section applies. To summarize the operational steps:
|
|
195
192
|
|
|
196
|
-
|
|
197
|
-
|
|
198
|
-
-
|
|
199
|
-
-
|
|
200
|
-
-
|
|
201
|
-
|
|
193
|
+
1. For each unavailable external channel, queue a compensating Claude self-review pass focused on that channel's strength area.
|
|
194
|
+
2. Label findings as `[compensating: Codex-equivalent]` or `[compensating: Gemini-equivalent]`.
|
|
195
|
+
3. Treat compensating findings as single-source confidence — they do not raise to high confidence even when they agree with another channel.
|
|
196
|
+
4. Maximum verdict is `degraded-pass` when any channel ran as compensating instead of real.
|
|
197
|
+
5. When both external channels are unavailable, note "All findings are single-model (Claude only). External validation was unavailable." in the review summary.
|
|
198
|
+
6. Never silently drop unavailable channels — always record the channel status and compensating coverage label in the review output.
|
|
202
199
|
|
|
203
|
-
|
|
200
|
+
**Superpowers channel exception:** Superpowers is a Claude subagent and requires no external CLI or auth. It is always available as long as the Superpowers plugin is installed in the Claude Code environment. If the plugin is not installed, run available external CLIs and warn the user that review coverage is reduced — but do not run a compensating pass for Superpowers (the compensating-pass mechanism only applies to external CLIs that have an installation/auth gate).
|
|
@@ -0,0 +1,219 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: multi-model-research-dispatch
|
|
3
|
+
description: Patterns for dispatching research and adversarial challenge to external AI models (Codex, Gemini) with reconciliation rules and single-model fallback
|
|
4
|
+
topics: [multi-model, research, competitive-analysis, red-team, codex, gemini, dispatch, reconciliation]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Multi-Model Research Dispatch
|
|
8
|
+
|
|
9
|
+
At higher methodology depths (4+), idea exploration and adversarial challenge benefit from independent research by external AI models. This entry provides dispatch patterns, reconciliation rules, and fallback strategies for research and red-team workflows.
|
|
10
|
+
|
|
11
|
+
## Summary
|
|
12
|
+
|
|
13
|
+
### When to Dispatch
|
|
14
|
+
| Depth | Research Dispatch | Challenge Dispatch |
|
|
15
|
+
|-------|-------------------|-------------------|
|
|
16
|
+
| 1-3 | Skip | Skip |
|
|
17
|
+
| 4 | 1 external model | 1 external model |
|
|
18
|
+
| 5 | Multi-model with reconciliation | Multi-model with reconciliation |
|
|
19
|
+
|
|
20
|
+
### Graceful Fallback Chain
|
|
21
|
+
1. Check if external CLI is available (`command -v codex`, `command -v gemini`)
|
|
22
|
+
2. If not installed, skip that model silently — note in Session Metadata
|
|
23
|
+
3. If installed, check auth (`codex login status`, `NO_BROWSER=true gemini -p "respond with ok" -o json`)
|
|
24
|
+
4. If auth fails, surface loudly to the user with `!` recovery command — do NOT silently skip
|
|
25
|
+
5. If auth succeeds, dispatch with timeout
|
|
26
|
+
6. If no external models available, fall back to primary model with distinct framing prompts
|
|
27
|
+
7. Never block the session waiting for unavailable tools
|
|
28
|
+
|
|
29
|
+
### Reconciliation Rules
|
|
30
|
+
- **2+ models agree** on the same finding = **consensus** — high confidence, present as validated
|
|
31
|
+
- **Models disagree** = **divergent** — present ALL perspectives including minority views. Do NOT suppress the minority. A 2-1 split where the lone dissent flags a real risk is more valuable than a comfortable consensus.
|
|
32
|
+
- **Single model** (fallback) = skip reconciliation labels. Present findings directly without consensus/divergent framing.
|
|
33
|
+
|
|
34
|
+
## Deep Guidance
|
|
35
|
+
|
|
36
|
+
### CLI Availability Check
|
|
37
|
+
|
|
38
|
+
Before dispatching, verify CLI tools are installed and authenticated:
|
|
39
|
+
|
|
40
|
+
### Foreground-Only Execution
|
|
41
|
+
|
|
42
|
+
When an AI agent dispatches research or challenge prompts via a tool runner, always run commands in the foreground. Background execution (`run_in_background`, `&`, `nohup`) produces empty or truncated output from Codex and Gemini CLIs. Multiple foreground calls can still run in parallel if the tool runner supports parallel tool invocations.
|
|
43
|
+
|
|
44
|
+
```bash
|
|
45
|
+
# Codex CLI — step 1: check installed
|
|
46
|
+
command -v codex >/dev/null 2>&1 || { echo "Codex not installed — skipping"; exit 0; }
|
|
47
|
+
# step 2: check auth
|
|
48
|
+
codex login status 2>/dev/null
|
|
49
|
+
# Exit 0 = ready. Non-zero = auth failure (surface to user).
|
|
50
|
+
|
|
51
|
+
# Gemini CLI — step 1: check installed
|
|
52
|
+
command -v gemini >/dev/null 2>&1 || { echo "Gemini not installed — skipping"; exit 0; }
|
|
53
|
+
# step 2: check auth
|
|
54
|
+
NO_BROWSER=true gemini -p "respond with ok" -o json 2>&1
|
|
55
|
+
# Check for "ok" in response. Exit 41 = auth failure (surface to user).
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
Two distinct failure modes:
|
|
59
|
+
- **Not installed** (`command -v` fails): skip silently, note in Session Metadata
|
|
60
|
+
- **Auth failed** (non-zero after install check): surface loudly — tell the user which tool failed and how to fix it:
|
|
61
|
+
- Codex: "Codex auth expired — run `! codex login` to re-authenticate"
|
|
62
|
+
- Gemini: "Gemini auth expired — run `! gemini -p \"hello\"` to re-authenticate"
|
|
63
|
+
|
|
64
|
+
Auth failures are NOT silent fallbacks — surface them explicitly.
|
|
65
|
+
|
|
66
|
+
### Timeout Handling
|
|
67
|
+
|
|
68
|
+
| Dispatch type | Timeout |
|
|
69
|
+
|---------------|---------|
|
|
70
|
+
| Research dispatch (idea summary + questions) | 120 seconds |
|
|
71
|
+
| Challenge dispatch (full brief review) | 180 seconds |
|
|
72
|
+
|
|
73
|
+
If a dispatch times out:
|
|
74
|
+
- Use whatever partial response was received (if parseable)
|
|
75
|
+
- Note the timeout in Session Metadata
|
|
76
|
+
- Do NOT retry — proceed with available data
|
|
77
|
+
|
|
78
|
+
### Research Dispatch Mode
|
|
79
|
+
|
|
80
|
+
**When**: Phase 2 at depth 4-5.
|
|
81
|
+
|
|
82
|
+
**Prompt template for external model:**
|
|
83
|
+
|
|
84
|
+
```
|
|
85
|
+
You are conducting independent competitive research for a product idea.
|
|
86
|
+
|
|
87
|
+
IDEA: [1-2 sentence summary of the idea from Phase 1]
|
|
88
|
+
|
|
89
|
+
RESEARCH QUESTIONS:
|
|
90
|
+
1. What are the direct competitors in this space? For each, note what they do well and where they fall short.
|
|
91
|
+
2. What indirect alternatives exist — different approaches to the same problem?
|
|
92
|
+
3. How do users currently cope without a dedicated solution?
|
|
93
|
+
4. What recent market signals exist — funding rounds, product launches, shutdowns, regulatory changes?
|
|
94
|
+
5. What adjacent markets or analogous systems could inform this idea?
|
|
95
|
+
|
|
96
|
+
Be thorough and honest. Acknowledge competitor strengths — do not dismiss them.
|
|
97
|
+
Respond in structured markdown with one section per question.
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
**Execution:**
|
|
101
|
+
|
|
102
|
+
```bash
|
|
103
|
+
# Codex
|
|
104
|
+
codex exec --skip-git-repo-check -s read-only --ephemeral "RESEARCH_PROMPT" 2>&1
|
|
105
|
+
|
|
106
|
+
# Gemini
|
|
107
|
+
NO_BROWSER=true gemini -p "RESEARCH_PROMPT" --output-format json --approval-mode yolo 2>/dev/null
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
**Processing results:**
|
|
111
|
+
- Parse the response as structured markdown
|
|
112
|
+
- Extract key findings per research question
|
|
113
|
+
- If multi-model (depth 5), run reconciliation (see below)
|
|
114
|
+
- Present findings to the user conversationally, not as raw output
|
|
115
|
+
|
|
116
|
+
### Challenge Dispatch Mode (Red-Team)
|
|
117
|
+
|
|
118
|
+
**When**: Phase 6 at depth 4-5.
|
|
119
|
+
|
|
120
|
+
**Prompt template for external model:**
|
|
121
|
+
|
|
122
|
+
```
|
|
123
|
+
You are an adversarial reviewer stress-testing a product idea brief.
|
|
124
|
+
Your job is to find weaknesses, challenge assumptions, and surface missed opportunities.
|
|
125
|
+
|
|
126
|
+
SPARK BRIEF:
|
|
127
|
+
[Full content of the draft spark-brief.md]
|
|
128
|
+
|
|
129
|
+
CHALLENGE INSTRUCTIONS:
|
|
130
|
+
1. For each section, identify the weakest assumption and explain why it might be wrong.
|
|
131
|
+
2. What competitors or market dynamics does the brief underestimate?
|
|
132
|
+
3. What technical feasibility risks are glossed over?
|
|
133
|
+
4. What user segments or use cases are missing?
|
|
134
|
+
5. If you could only flag ONE critical risk, what would it be?
|
|
135
|
+
|
|
136
|
+
Be constructive but ruthless. The goal is to strengthen the idea, not validate it.
|
|
137
|
+
Respond in structured markdown with one section per challenge area.
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
**Processing results:**
|
|
141
|
+
- Parse challenges from response
|
|
142
|
+
- Present each challenge to the user one at a time
|
|
143
|
+
- For each challenge, ask: "Accept (update the brief), dismiss (explain why it's not applicable), or defer (note as open question)?"
|
|
144
|
+
- Track dispositions and update the brief accordingly
|
|
145
|
+
|
|
146
|
+
### Single-Model Fallback
|
|
147
|
+
|
|
148
|
+
When no external models are available, the primary model simulates multiple perspectives:
|
|
149
|
+
|
|
150
|
+
**Perspective 1 — Venture Capitalist**: "Analyze this idea as a VC evaluating a pitch. What's the market size? What's the defensibility? What are the unit economics? Would you invest?"
|
|
151
|
+
|
|
152
|
+
**Perspective 2 — Competitor's Product Lead**: "You're the product lead at [biggest competitor]. You just learned about this idea. What's your reaction? What would you do to defend your position? What aspects worry you?"
|
|
153
|
+
|
|
154
|
+
**Perspective 3 — Skeptical End User**: "You're a potential user who has tried and abandoned 3 similar products. What would make you try this one? What would make you abandon it after a week? What's the one thing that would keep you?"
|
|
155
|
+
|
|
156
|
+
Run each perspective as a separate reasoning pass. Synthesize the three viewpoints into findings the user can act on.
|
|
157
|
+
|
|
158
|
+
### Model Selection
|
|
159
|
+
|
|
160
|
+
| Task | Recommended model | Rationale |
|
|
161
|
+
|------|-------------------|-----------|
|
|
162
|
+
| Research dispatch | Either Codex or Gemini | Both capable of web-informed reasoning |
|
|
163
|
+
| Challenge dispatch | Either Codex or Gemini | Adversarial analysis is model-agnostic |
|
|
164
|
+
| Depth 4 (1 model) | Prefer Gemini (Google search built-in) | Strongest for competitive research |
|
|
165
|
+
| Depth 5 (multi) | Both Codex AND Gemini | Diverse perspectives from different architectures |
|
|
166
|
+
|
|
167
|
+
### Reconciliation Process (Depth 5)
|
|
168
|
+
|
|
169
|
+
When two or more models return research findings, reconcile them:
|
|
170
|
+
|
|
171
|
+
1. **Extract findings**: Parse each model's response into discrete findings (one competitor, one market signal, one risk = one finding).
|
|
172
|
+
2. **Match findings**: Compare findings across models. Two findings match if they reference the same entity (competitor, trend, risk) even if the wording differs.
|
|
173
|
+
3. **Classify each finding**:
|
|
174
|
+
- **Consensus**: 2+ models independently identified the same finding. High confidence.
|
|
175
|
+
- **Divergent**: Models disagree about the same entity (e.g., one says competitor X is strong, another says X is weak). Present both perspectives with reasoning.
|
|
176
|
+
- **Unique**: Only one model surfaced this finding. Not necessarily wrong — may be the most valuable insight. Present it without discounting.
|
|
177
|
+
4. **Synthesize for the user**: Present findings grouped by classification. Lead with consensus (highest confidence), then unique (potential insights), then divergent (needs user judgment).
|
|
178
|
+
5. **Never suppress minority views**: A lone model flagging a risk that others missed may be the most important finding in the entire research pass.
|
|
179
|
+
|
|
180
|
+
### Quality Gates
|
|
181
|
+
|
|
182
|
+
Before presenting research findings to the user, verify:
|
|
183
|
+
|
|
184
|
+
- At least 2 competitors or alternatives identified (even at depth 4 with single model)
|
|
185
|
+
- Each competitor has both a strength and a weakness documented
|
|
186
|
+
- The "do nothing" option is addressed (how users cope without any tool)
|
|
187
|
+
- Market timing signals are present (why now?)
|
|
188
|
+
- If multi-model: reconciliation labels (consensus/divergent/unique) are applied
|
|
189
|
+
|
|
190
|
+
### Common Anti-Patterns
|
|
191
|
+
|
|
192
|
+
| Anti-pattern | Problem | Fix |
|
|
193
|
+
|-------------|---------|-----|
|
|
194
|
+
| Dismissing competitors | "They're not really competition" — every alternative is competition | Acknowledge strengths honestly |
|
|
195
|
+
| Echo chamber | Both models agree because both drew from the same training data | Look for unique findings, not just consensus |
|
|
196
|
+
| Recency bias | Focusing only on recent launches, ignoring established players | Include both established and emerging competitors |
|
|
197
|
+
| Feature-list comparison | Comparing feature lists instead of positioning | Compare on audience, value prop, and differentiation |
|
|
198
|
+
| Silent fallback | External model fails, no mention in output | Always note which models were used and any failures |
|
|
199
|
+
| Over-synthesis | Merging distinct findings into one summary, losing nuance | Preserve individual findings before synthesizing |
|
|
200
|
+
|
|
201
|
+
### Output Format
|
|
202
|
+
|
|
203
|
+
When presenting research findings to the user, structure them as:
|
|
204
|
+
|
|
205
|
+
**Competitive Landscape:**
|
|
206
|
+
- [Competitor 1]: Strengths — [specifics]. Weaknesses — [specifics]. Why users choose them — [specifics].
|
|
207
|
+
- [Competitor 2]: ...
|
|
208
|
+
- "Do nothing" option: How users cope today — [specifics]. Why it's insufficient — [specifics].
|
|
209
|
+
|
|
210
|
+
**Market Signals:**
|
|
211
|
+
- [Signal 1]: [What happened, when, why it matters for this idea]
|
|
212
|
+
- [Signal 2]: ...
|
|
213
|
+
|
|
214
|
+
**Expansion Opportunities** (from adjacent market research):
|
|
215
|
+
- [Opportunity 1]: [What it is, why it's relevant, how it connects]
|
|
216
|
+
|
|
217
|
+
**Red-Team Challenges** (from adversarial review):
|
|
218
|
+
- [Challenge 1]: [Weakness identified, why it matters, recommended action]
|
|
219
|
+
- Disposition: [accept/dismiss/defer — tracked after user response]
|