devlyn-cli 1.9.1 → 1.11.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +3 -1
- package/config/skills/devlyn:auto-resolve/SKILL.md +52 -2
- package/config/skills/devlyn:ideate/SKILL.md +69 -4
- package/config/skills/devlyn:ideate/references/challenge-rubric.md +122 -0
- package/config/skills/devlyn:ideate/references/codex-debate.md +109 -0
- package/package.json +5 -1
package/CLAUDE.md
CHANGED
|
@@ -56,10 +56,12 @@ For hands-free build-evaluate-polish cycles — works for bugs, features, refact
|
|
|
56
56
|
/devlyn:auto-resolve [task description]
|
|
57
57
|
```
|
|
58
58
|
|
|
59
|
-
This runs the full pipeline automatically: **Build → Build Gate → Browser Validate → Evaluate → Fix Loop → Simplify → Review → Security Review → Clean → Docs**. Each phase runs as a separate subagent with its own context. Communication between phases happens via files (`.devlyn/done-criteria.md`, `.devlyn/BUILD-GATE.md`, `.devlyn/EVAL-FINDINGS.md`, `.devlyn/BROWSER-RESULTS.md`).
|
|
59
|
+
This runs the full pipeline automatically: **Build → Build Gate → Browser Validate → Evaluate → Fix Loop → Simplify → Review → Challenge → Security Review → Clean → Docs**. Each phase runs as a separate subagent with its own context. Communication between phases happens via files (`.devlyn/done-criteria.md`, `.devlyn/BUILD-GATE.md`, `.devlyn/EVAL-FINDINGS.md`, `.devlyn/BROWSER-RESULTS.md`, `.devlyn/CHALLENGE-FINDINGS.md`).
|
|
60
60
|
|
|
61
61
|
The **Build Gate** (Phase 1.4) runs real compilers, typecheckers, and linters — the same commands CI/Docker/production will run. It auto-detects project types (Next.js, Rust, Go, Solidity, Expo, Swift, etc.) and Dockerfiles. This is the primary defense against "tests pass locally, breaks in CI/Docker" class of bugs (type errors in un-tested files, cross-package drift, Dockerfile copy mismatches).
|
|
62
62
|
|
|
63
|
+
The **Challenge** phase (Phase 4.5) is a fresh skeptical review with no checklist — a subagent reads the entire diff cold with zero context from prior phases and asks "would I ship this to production with my name on it?" This catches the subtle issues that structured checklist-driven reviews miss: wrong-but-working approaches, unstated assumptions, non-idiomatic patterns, and integration gaps.
|
|
64
|
+
|
|
63
65
|
For web projects, the Browser Validate phase starts the dev server and tests the implemented feature in a real browser — clicking buttons, filling forms, verifying results. If the feature doesn't work, findings feed back into the fix loop.
|
|
64
66
|
|
|
65
67
|
Optional flags:
|
|
@@ -45,7 +45,7 @@ This pipeline runs hands-free. The user launches it to walk away and come back t
|
|
|
45
45
|
```
|
|
46
46
|
Auto-resolve pipeline starting
|
|
47
47
|
Task: [extracted task description]
|
|
48
|
-
Phases: Build → Build Gate → [Browser] → Evaluate → [Fix loop if needed] → Simplify → [Review] → [Security] → [Clean] → [Docs]
|
|
48
|
+
Phases: Build → Build Gate → [Browser] → Evaluate → [Fix loop if needed] → Simplify → [Review] → Challenge → [Security] → [Clean] → [Docs]
|
|
49
49
|
Max evaluation rounds: [N]
|
|
50
50
|
Cross-model evaluation (Codex): [evaluate / review / both / disabled]
|
|
51
51
|
```
|
|
@@ -278,6 +278,55 @@ Clean up the team after completion.
|
|
|
278
278
|
1. If CRITICAL issues remain unfixed, log a warning in the final report
|
|
279
279
|
2. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): review fixes complete"` if there are changes
|
|
280
280
|
|
|
281
|
+
## PHASE 4.5: CHALLENGE
|
|
282
|
+
|
|
283
|
+
Every prior phase used checklists, done-criteria, or structured categories. This phase is deliberately different — it's a fresh pair of eyes with no checklist, no prior context, and a skeptical mandate. The subagent hasn't seen the done-criteria, the eval findings, or the review results. It reads the raw diff cold and asks: "would I mass-ship this?"
|
|
284
|
+
|
|
285
|
+
This is what catches the things structured reviews miss — subtle logic that technically works but isn't the right approach, assumptions nobody questioned, patterns that are fine but not best-practice, and integration seams that look correct in isolation but feel wrong when you read the whole changeset.
|
|
286
|
+
|
|
287
|
+
Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`.
|
|
288
|
+
|
|
289
|
+
Agent prompt — pass this to the Agent tool:
|
|
290
|
+
|
|
291
|
+
You are a senior engineer doing a final skeptical review before this code ships to production. You have NOT seen any prior reviews, test results, or design docs — read the code cold.
|
|
292
|
+
|
|
293
|
+
Run `git diff main` to see all changes. Read every changed file in full (not just the diff hunks — you need surrounding context).
|
|
294
|
+
|
|
295
|
+
Your job is NOT to check boxes. Your job is to find the things that would make a staff engineer say "hold on, let's talk about this before we ship." Think about:
|
|
296
|
+
|
|
297
|
+
- Would this approach survive a 10x traffic spike? A midnight oncall page? A junior dev maintaining it 6 months from now?
|
|
298
|
+
- Are there assumptions baked in that nobody stated out loud? Hardcoded limits, implicit ordering, missing edge cases in business logic?
|
|
299
|
+
- Is the error handling actually helpful, or does it just prevent crashes while leaving the user confused?
|
|
300
|
+
- Are there simpler, more idiomatic ways to do what this code does? Not "clever" alternatives — genuinely better approaches?
|
|
301
|
+
- Would you mass-confidence approve this PR, or would you leave comments?
|
|
302
|
+
|
|
303
|
+
Be brutally honest. Do NOT start with praise. Do NOT soften findings. Every finding must include `file:line` and a concrete fix — not "consider improving" but "change X to Y because Z."
|
|
304
|
+
|
|
305
|
+
Write `.devlyn/CHALLENGE-FINDINGS.md`:
|
|
306
|
+
|
|
307
|
+
```
|
|
308
|
+
# Challenge Findings
|
|
309
|
+
## Verdict: [PASS / NEEDS WORK]
|
|
310
|
+
## Findings
|
|
311
|
+
### [severity: CRITICAL / HIGH / MEDIUM]
|
|
312
|
+
- `file:line` — what's wrong — Fix: concrete change
|
|
313
|
+
```
|
|
314
|
+
|
|
315
|
+
Verdict: PASS only if you would mass-confidently mass-ship this code with your name on it. If you found anything CRITICAL or HIGH, verdict is NEEDS WORK.
|
|
316
|
+
|
|
317
|
+
**After the agent completes**:
|
|
318
|
+
1. Read `.devlyn/CHALLENGE-FINDINGS.md`
|
|
319
|
+
2. Extract the verdict
|
|
320
|
+
3. Branch:
|
|
321
|
+
- `PASS` → continue to PHASE 5
|
|
322
|
+
- `NEEDS WORK` → spawn a fix subagent with `mode: "bypassPermissions"`:
|
|
323
|
+
|
|
324
|
+
Read `.devlyn/CHALLENGE-FINDINGS.md` — it contains findings from a fresh skeptical review. Fix every CRITICAL and HIGH finding at the root cause. For MEDIUM findings, fix if straightforward. After fixing, run the test suite to verify nothing broke.
|
|
325
|
+
|
|
326
|
+
After the fix agent completes:
|
|
327
|
+
1. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): challenge fixes complete"`
|
|
328
|
+
2. Continue to PHASE 5 (do NOT re-run the challenge — one pass is sufficient to avoid infinite loops)
|
|
329
|
+
|
|
281
330
|
## PHASE 5: SECURITY REVIEW (conditional)
|
|
282
331
|
|
|
283
332
|
Determine whether to run this phase:
|
|
@@ -343,7 +392,7 @@ Synchronize documentation with recent code changes. Use `git log --oneline -20`
|
|
|
343
392
|
After all phases complete:
|
|
344
393
|
|
|
345
394
|
1. Clean up temporary files:
|
|
346
|
-
- Delete the `.devlyn/` directory entirely (contains done-criteria.md, BUILD-GATE.md, EVAL-FINDINGS.md, BROWSER-RESULTS.md, screenshots/, playwright temp files)
|
|
395
|
+
- Delete the `.devlyn/` directory entirely (contains done-criteria.md, BUILD-GATE.md, EVAL-FINDINGS.md, BROWSER-RESULTS.md, CHALLENGE-FINDINGS.md, screenshots/, playwright temp files)
|
|
347
396
|
- Kill any dev server process still running from browser validation
|
|
348
397
|
|
|
349
398
|
2. Run `git log --oneline -10` to show commits made during the pipeline
|
|
@@ -367,6 +416,7 @@ After all phases complete:
|
|
|
367
416
|
| Simplify | [completed / skipped] | [changes made] |
|
|
368
417
|
| Review (Claude team) | [completed / skipped] | [findings summary] |
|
|
369
418
|
| Review (Codex) | [completed / skipped] | [Codex-only findings, agreed findings] |
|
|
419
|
+
| Challenge | [PASS / NEEDS WORK] | [findings count, fixes applied] |
|
|
370
420
|
| Security review | [completed / skipped / auto-skipped] | [findings or "no security-sensitive changes"] |
|
|
371
421
|
| Clean | [completed / skipped] | [items cleaned] |
|
|
372
422
|
| Docs (update-docs) | [completed / skipped] | [docs updated] |
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: devlyn:ideate
|
|
3
|
-
description:
|
|
3
|
+
description: Transforms unstructured ideas into implementation-ready planning documents through structured brainstorming, research, and a built-in self-skeptical rubric pass. Produces a three-layer document architecture (Vision, Roadmap index, auto-resolve-ready specs) to eliminate context pollution in the implementation pipeline. Optional --with-codex flag adds OpenAI Codex as a cross-model critic. Use when the user wants to brainstorm, plan a new project or feature set, create a vision and roadmap, or structure scattered ideas into an actionable plan. Triggers on "let's brainstorm", "let's plan", "ideate", "I have an idea for", "help me think through", "let's explore", new project planning, feature discovery, roadmap creation, or when the user is throwing ideas that need structuring.
|
|
4
4
|
---
|
|
5
5
|
|
|
6
6
|
# Ideation to Implementation Bridge
|
|
@@ -20,6 +20,14 @@ Concretely:
|
|
|
20
20
|
- If you catch yourself about to open a source file to make a code change, stop — that's a signal you've left ideation mode
|
|
21
21
|
</hard_boundary>
|
|
22
22
|
|
|
23
|
+
## Arguments
|
|
24
|
+
|
|
25
|
+
Parse these from the user's invocation message:
|
|
26
|
+
|
|
27
|
+
- `--with-codex` (default: off) — bare flag. When set, OpenAI Codex runs an independent rubric pass during Phase 3.5 CHALLENGE via `mcp__codex-cli__*` MCP tools, using the same rubric as the solo pass. Codex always runs at `reasoningEffort: "xhigh"` — the entire reason for the flag is maximum reasoning from a second model family.
|
|
28
|
+
|
|
29
|
+
**If `--with-codex` is set**: read `references/challenge-rubric.md` and `references/codex-debate.md` up front, then run the pre-flight check described in `codex-debate.md` to verify the Codex MCP server is available before starting the pipeline. If the server is unavailable and the user opts to continue without Codex, the solo CHALLENGE pass still runs — only the cross-model rubric pass is disabled.
|
|
30
|
+
|
|
23
31
|
<why_this_matters>
|
|
24
32
|
When ideas flow directly from conversation to `/devlyn:auto-resolve`, context degrades at each handoff:
|
|
25
33
|
- Abstract vision statements cause over-engineering (the agent optimizes for principles instead of deliverables)
|
|
@@ -271,8 +279,47 @@ Within each phase:
|
|
|
271
279
|
### Architecture Decisions
|
|
272
280
|
Surface decisions that affect multiple items — technology choices, data model, integration approaches, UX patterns. For each: **What** was decided, **Why** (tradeoffs), and **What alternatives** were considered. These become decision records.
|
|
273
281
|
|
|
274
|
-
###
|
|
275
|
-
|
|
282
|
+
### Internal draft — do not show the user yet
|
|
283
|
+
|
|
284
|
+
At this point you have an internal convergence draft: themes, phases, items, decisions. **Do not present it to the user yet.** Phase 3.5 CHALLENGE runs next, and the user will see exactly one summary — the post-challenge plan, with visibility into what CHALLENGE changed. Showing the pre-challenge draft first and then changing it after challenge creates a two-round confirmation loop that burns the user's trust.
|
|
285
|
+
|
|
286
|
+
## Phase 3.5: CHALLENGE
|
|
287
|
+
|
|
288
|
+
<phase_goal>Apply a strict 5-axis rubric to the internal convergence draft, then present one post-challenge summary to the user for confirmation. Always runs.</phase_goal>
|
|
289
|
+
|
|
290
|
+
<thinking_effort>
|
|
291
|
+
Engage maximum thinking effort here — both the solo rubric pass and, if enabled, the Codex pass. Use extended thinking ("ultrathink") when reading each item, applying each axis, and producing revisions. The default Claude failure mode in self-review is nodding along to the draft you just produced; shallow thinking here is the exact pattern this phase exists to prevent.
|
|
292
|
+
|
|
293
|
+
Before finalizing the rubric pass, verify your findings against the rubric one more time: every flagged item should have a specific Quote, a failing axis, and a concrete revision — not a vague concern.
|
|
294
|
+
</thinking_effort>
|
|
295
|
+
|
|
296
|
+
The user has been burned by plans that look good on the surface but fall apart under scrutiny. Every time they accept a plan and then ask "is this no-workaround, no-guesswork, no-overengineering, world-class best practice, optimized?" the honest answer is almost always no. This phase makes that the *default* behavior — the plan challenges itself before the user has to.
|
|
297
|
+
|
|
298
|
+
### The rubric — single source of truth
|
|
299
|
+
|
|
300
|
+
Read `references/challenge-rubric.md` before starting. That file is the only definition of the 5 axes, the finding format, the hard rule about respecting explicit user intent, and the good-vs-bad examples. Both the solo pass and the Codex pass use the same rubric; do not re-derive it inline.
|
|
301
|
+
|
|
302
|
+
### Solo pass (always runs)
|
|
303
|
+
|
|
304
|
+
Apply the rubric to the internal convergence draft. Produce findings in the format specified in `challenge-rubric.md` (Severity / Quote / Axis / Why / Fix).
|
|
305
|
+
|
|
306
|
+
For Quick Add with one new item, one solo pass is enough. For a full greenfield or expand plan, run the rubric once, revise, and run it again on the revision. If a third pass would be needed, the plan has structural problems that belong in the user-facing summary as open questions — surface them rather than iterating further.
|
|
307
|
+
|
|
308
|
+
If the plan came from one model in one pass, it almost always fails at least one axis somewhere. Nodding along to your own draft defeats the entire point of the phase.
|
|
309
|
+
|
|
310
|
+
### Codex pass (only if `--with-codex` is set)
|
|
311
|
+
|
|
312
|
+
If the flag is set you have already loaded `references/codex-debate.md` during argument parsing — follow its "PHASE 3.5-CODEX" section now. Codex applies the rubric from `challenge-rubric.md` independently at `reasoningEffort: "xhigh"`. Reconcile findings as `codex-debate.md` describes — findings raised by both sides get "confirmed by both", Codex-only findings get prefixed `[codex]` in internal notes so the user can see where each push came from.
|
|
313
|
+
|
|
314
|
+
### Respect explicit user intent
|
|
315
|
+
|
|
316
|
+
The rubric is a quality lens, not an override. If a finding conflicts with something the user explicitly and clearly asked for, follow the "Hard rule" section in `challenge-rubric.md`: record the finding, **do not silently rewrite the plan**, and surface it as an open question in the summary below. The user makes the call.
|
|
317
|
+
|
|
318
|
+
### User-facing summary (the first and only time the user sees the plan)
|
|
319
|
+
|
|
320
|
+
After the rubric pass(es), present the post-challenge plan to the user for confirmation. This is the first time the user sees the converged plan — by design, so they see a rubric-checked result rather than a draft that immediately gets revised.
|
|
321
|
+
|
|
322
|
+
Format:
|
|
276
323
|
```
|
|
277
324
|
Vision: [one sentence]
|
|
278
325
|
Phases: [N] phases, [M] total items
|
|
@@ -280,9 +327,24 @@ Phase 1 ([theme]): [items with brief descriptions]
|
|
|
280
327
|
Phase 2 ([theme]): [items]
|
|
281
328
|
Key decisions: [list]
|
|
282
329
|
Deferred: [items with reasons]
|
|
330
|
+
|
|
331
|
+
## CHALLENGE results
|
|
332
|
+
|
|
333
|
+
Solo pass: [N findings, M applied]
|
|
334
|
+
Codex pass: [N findings, M applied] ← only if --with-codex was set
|
|
335
|
+
|
|
336
|
+
Changes applied during CHALLENGE:
|
|
337
|
+
- [item]: [what changed and which axis triggered it]
|
|
338
|
+
|
|
339
|
+
Open questions for you (rubric flagged something you explicitly asked for):
|
|
340
|
+
- [item]: rubric says [finding]; you asked for [original]; here is the tradeoff — proceed as-is, or adopt the alternative?
|
|
283
341
|
```
|
|
284
342
|
|
|
285
|
-
Get explicit confirmation before proceeding to
|
|
343
|
+
Get explicit confirmation before proceeding to DOCUMENT.
|
|
344
|
+
|
|
345
|
+
### Quick Add mode
|
|
346
|
+
|
|
347
|
+
For single-item additions, run one solo rubric pass on just the new item. Even then do not skip — single-item additions are exactly where overengineering and workarounds slip in unnoticed, because the lack of surrounding context makes a bad item look self-contained and harmless.
|
|
286
348
|
|
|
287
349
|
## Phase 4: DOCUMENT
|
|
288
350
|
|
|
@@ -376,6 +438,9 @@ Before finalizing, verify:
|
|
|
376
438
|
- [ ] No spec requires reading VISION.md to be understood (self-contained)
|
|
377
439
|
- [ ] Dependencies between items are documented in both specs
|
|
378
440
|
- [ ] Architecture decisions include reasoning and alternatives considered
|
|
441
|
+
- [ ] CHALLENGE ran against `references/challenge-rubric.md` (solo, plus Codex if `--with-codex` was set); no item still fails any axis at CRITICAL or HIGH severity
|
|
442
|
+
- [ ] User saw the post-challenge plan as the first and only confirmation prompt — no pre-challenge draft was shown first
|
|
443
|
+
- [ ] Any rubric finding that conflicted with explicit user intent was surfaced as an open question, not silently applied
|
|
379
444
|
|
|
380
445
|
## Language
|
|
381
446
|
|
|
@@ -0,0 +1,122 @@
|
|
|
1
|
+
# CHALLENGE Rubric (single source of truth)
|
|
2
|
+
|
|
3
|
+
## Contents
|
|
4
|
+
- Context — this is a planning rubric
|
|
5
|
+
- The 5 axes (NO WORKAROUND, NO GUESSWORK, NO OVERENGINEERING, WORLD-CLASS BEST PRACTICE, OPTIMIZED)
|
|
6
|
+
- Hard rule — respect explicit user intent
|
|
7
|
+
- Finding format
|
|
8
|
+
- Examples (good vs bad findings, plus a detour-sequencing example)
|
|
9
|
+
|
|
10
|
+
The 5-axis rubric applied in Phase 3.5 CHALLENGE of `devlyn:ideate`. Both the solo Claude pass and the Codex pass (when `--with-codex` is set) use this file — there is exactly one definition of the rubric, and both paths read it directly from SKILL.md.
|
|
11
|
+
|
|
12
|
+
The rubric exists because plans produced in a single pass, by a single model, in a single conversation almost always fail at least one axis somewhere. The user's historical experience: every time they asked "is this really no-workaround, no-guesswork, no-overengineering, world-class, optimized?", the honest answer was no. This phase makes the answer honestly yes before the user even has to ask.
|
|
13
|
+
|
|
14
|
+
## Context — this is a PLANNING rubric, not a code rubric
|
|
15
|
+
|
|
16
|
+
This rubric judges the shape of the roadmap: what items exist, in what order, why. It does NOT judge implementation details, code style, or abstractions in code. "Overengineering" here means overengineering the plan, not overengineering a function. When applying it, keep asking: *is this the most direct, optimized path from the user's stated problem to a working outcome?*
|
|
17
|
+
|
|
18
|
+
## The 5 axes
|
|
19
|
+
|
|
20
|
+
### 1. NO WORKAROUND
|
|
21
|
+
|
|
22
|
+
Does the item solve the actual problem directly, or does it route around a missing capability? If the direct path is "build X" and the item is "work around not having X", it fails.
|
|
23
|
+
|
|
24
|
+
Canonical failure pattern: the user asks for a feature that papers over a missing foundation. Building the feature adds an item to the plan without solving the real problem, and often makes the real problem harder to fix later.
|
|
25
|
+
|
|
26
|
+
### 2. NO GUESSWORK
|
|
27
|
+
|
|
28
|
+
Every requirement must be grounded in something the user explicitly confirmed, or in something verifiable from the problem framing. Silent assumptions, "I think the user probably wants...", and requirements invented to fill gaps all fail.
|
|
29
|
+
|
|
30
|
+
Canonical failure pattern: vague user input ("improve the dashboard") leads to a fully-specified plan full of invented detail. Correct handling is to mark every assumed fact as [ASSUMED], ask clarifying questions, and keep the plan minimal until the user fills in the gaps.
|
|
31
|
+
|
|
32
|
+
### 3. NO OVERENGINEERING (planning-stage)
|
|
33
|
+
|
|
34
|
+
The plan fails this axis when it contains any of:
|
|
35
|
+
|
|
36
|
+
- **Luxury items** — polish, theming, animations, nice-to-haves that do not serve the stated problem. A polish/theming item in Phase 1 of a tool that does not yet solve its core job.
|
|
37
|
+
- **Filler items** — items added to pad a phase or make the plan feel complete. If an item has no testable requirement a real user would notice if absent, it is filler.
|
|
38
|
+
- **Detour sequencing** — the plan takes the long way around when a direct route exists. Three items building toward X when one item could deliver X. Separate scaffold / store / deploy items when they could be bundled into the actual feature they enable.
|
|
39
|
+
- **Roadmap workarounds masquerading as features** — see axis 1. The same failure can fire on axis 1 (paper-over) and axis 3 (padding the roadmap with the workaround).
|
|
40
|
+
|
|
41
|
+
The question to ask for every item: *"Is this the most direct, optimized path to the stated goal, or are we decorating / detouring / papering over?"*
|
|
42
|
+
|
|
43
|
+
### 4. WORLD-CLASS BEST PRACTICE
|
|
44
|
+
|
|
45
|
+
Would a senior team at a top company structure the roadmap this way for this kind of product today? If a known-good pattern exists for sequencing or decomposing this kind of problem, name it and use it.
|
|
46
|
+
|
|
47
|
+
Canonical failure pattern: the plan uses a familiar-but-mediocre decomposition when a better-known-good pattern exists for the specific problem type. Example: using manual export/import for cross-device sync when autosave + cloud draft storage is the standard pattern across mainstream editing tools (Notion, Linear, Gmail, Google Docs).
|
|
48
|
+
|
|
49
|
+
### 5. OPTIMIZED
|
|
50
|
+
|
|
51
|
+
Does the sequencing minimize wait time, front-load risk, and ship user-visible value at every phase boundary? Dead phases — phases that are pure setup with no visible win for a real user — are a fail.
|
|
52
|
+
|
|
53
|
+
Canonical failure pattern: Phase 1 is entirely infrastructure (scaffold, models, deploy) and the first user-facing win arrives in Phase 2. Better: Phase 1 ships one thin vertical slice that a real user can use, even if it is small.
|
|
54
|
+
|
|
55
|
+
## Hard rule — respect explicit user intent
|
|
56
|
+
|
|
57
|
+
The rubric is a tool to prevent drift from quality, not a tool to override the user. If the user has explicitly and clearly stated a preference ("I want X, not Y"), the rubric does not silently replace X with Y. Instead:
|
|
58
|
+
|
|
59
|
+
- Run the rubric as normal.
|
|
60
|
+
- If an axis flags X, do not rewrite the plan. Record the finding and surface it to the user as an open question: "The rubric flags X on [axis] because [reason]. You explicitly asked for X — confirm you want to proceed, or consider [alternative]."
|
|
61
|
+
- The user makes the call. The rubric's job is to make the tradeoff visible, not to make the decision.
|
|
62
|
+
|
|
63
|
+
This rule exists because the 5-axis rubric is an opinionated lens, and opinionated lenses are wrong sometimes. The user's stated intent is ground truth when it is explicit. The rubric is ground truth only for things the user did not explicitly decide.
|
|
64
|
+
|
|
65
|
+
## Finding format
|
|
66
|
+
|
|
67
|
+
For every item that fails any axis, produce a finding in this exact format:
|
|
68
|
+
|
|
69
|
+
```
|
|
70
|
+
Severity: CRITICAL / HIGH / MEDIUM / LOW
|
|
71
|
+
Quote: [copy the specific item title or line you are critiquing — one line]
|
|
72
|
+
Axis: [which of the five]
|
|
73
|
+
Why it fails: [one sentence]
|
|
74
|
+
Fix: [one concrete revision — not "reconsider X", say what to do instead]
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
For the plan as a whole, give a one-line pass/fail per axis with one-sentence reasoning.
|
|
78
|
+
|
|
79
|
+
End with a verdict: `PASS / PASS WITH MINOR FIXES / FAIL — REVISION REQUIRED`.
|
|
80
|
+
|
|
81
|
+
The Quote field is load-bearing. It anchors each finding to a specific line in the plan, which prevents the common failure mode of generic unanchored critiques ("too much in Phase 1", "consider refactoring"). Anchored findings are actionable; unanchored findings are noise.
|
|
82
|
+
|
|
83
|
+
## Examples
|
|
84
|
+
|
|
85
|
+
<example>
|
|
86
|
+
BAD finding (too vague, not actionable):
|
|
87
|
+
Severity: HIGH
|
|
88
|
+
Axis: NO OVERENGINEERING
|
|
89
|
+
Why: Phase 1 has too much.
|
|
90
|
+
Fix: Reduce scope.
|
|
91
|
+
|
|
92
|
+
GOOD finding (anchored, specific, actionable):
|
|
93
|
+
Severity: HIGH
|
|
94
|
+
Quote: "1.3 — Theme customization (light/dark/custom accent colors)"
|
|
95
|
+
Axis: NO OVERENGINEERING (luxury item)
|
|
96
|
+
Why it fails: The product does not yet solve its core job of letting users save a session; theming is a decoration item that does not move the primary problem forward.
|
|
97
|
+
Fix: Move 1.3 to backlog. Phase 1 is shorter by one item. Revisit theming only after the core save flow is shipped and used.
|
|
98
|
+
</example>
|
|
99
|
+
|
|
100
|
+
<example>
|
|
101
|
+
BAD finding:
|
|
102
|
+
Severity: HIGH
|
|
103
|
+
Axis: NO WORKAROUND
|
|
104
|
+
Why: Item 2.1 is a workaround.
|
|
105
|
+
Fix: Do it properly.
|
|
106
|
+
|
|
107
|
+
GOOD finding:
|
|
108
|
+
Severity: CRITICAL
|
|
109
|
+
Quote: "2.1 — Export/import session as JSON file so users can move work between devices"
|
|
110
|
+
Axis: NO WORKAROUND
|
|
111
|
+
Why it fails: The real problem is cross-device sync. File export is a roadmap workaround that asks the user to do the sync manually; it adds an item to the plan without solving the stated problem, and makes the real problem harder to fix later.
|
|
112
|
+
Fix: Replace 2.1 with "Cloud-backed session storage" as a direct cross-device solution. If cloud storage is out of scope for the current phase, explicitly defer cross-device sync to a later phase rather than shipping a manual workaround as if it were the feature.
|
|
113
|
+
</example>
|
|
114
|
+
|
|
115
|
+
<example>
|
|
116
|
+
Detour sequencing finding:
|
|
117
|
+
Severity: MEDIUM
|
|
118
|
+
Quote: "Phase 1: 1.1-scaffold, 1.2-data-store, 1.3-log-today, 1.4-streak-display, 1.5-history-view, 1.6-manage-habits, 1.7-deploy"
|
|
119
|
+
Axis: NO OVERENGINEERING (detour sequencing)
|
|
120
|
+
Why it fails: Scaffold, data store, streak display, and deploy are not features a user would notice as separate items — they are implementation steps of the three actual user capabilities (log a habit, see streak, see history). Splitting them into standalone roadmap items pads the plan without delivering value at each boundary.
|
|
121
|
+
Fix: Collapse Phase 1 to 2 items: "1.1 — Log a habit and see streak" (bundles scaffold + store + log + streak), "1.2 — History view". Deploy is part of each item's done criteria, not a standalone item. Result: 7 items → 2 items, same delivered scope.
|
|
122
|
+
</example>
|
|
@@ -0,0 +1,109 @@
|
|
|
1
|
+
# Codex Cross-Model Rubric Pass
|
|
2
|
+
|
|
3
|
+
## Contents
|
|
4
|
+
- Pre-flight check (verify Codex MCP server availability)
|
|
5
|
+
- PHASE 3.5-CODEX: packaging the plan, calling Codex, reconciling findings with the solo pass
|
|
6
|
+
- Cost notes (one Codex call per ideation session)
|
|
7
|
+
|
|
8
|
+
Instructions for using OpenAI Codex as an independent critic during Phase 3.5 CHALLENGE. Only read this file when `--with-codex` is set. The 5-axis rubric itself lives in `challenge-rubric.md` — Claude loads that file directly from SKILL.md, not via this file.
|
|
9
|
+
|
|
10
|
+
Codex is accessed via `mcp__codex-cli__*` MCP tools (provided by codex-mcp-server). The intent: one opinionated rubric pass from a different model family, applied right before the user sees the plan. Two model families catch different blind spots; one pass at maximum effort catches more than multiple shallow passes.
|
|
11
|
+
|
|
12
|
+
**Always use `reasoningEffort: "xhigh"` and `sandbox: "read-only"` for every Codex call in this file.** Maximum reasoning is the whole reason the `--with-codex` flag exists — lowering it defeats the purpose of bringing in a second model.
|
|
13
|
+
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
## PRE-FLIGHT CHECK
|
|
17
|
+
|
|
18
|
+
Before starting the pipeline, verify the Codex MCP server is available by calling `mcp__codex-cli__ping`.
|
|
19
|
+
|
|
20
|
+
- **If ping succeeds**: continue.
|
|
21
|
+
- **If ping fails or `mcp__codex-cli__ping` is not found**: warn the user and ask:
|
|
22
|
+
```
|
|
23
|
+
⚠ Codex MCP server not detected. --with-codex requires codex-mcp-server.
|
|
24
|
+
|
|
25
|
+
To install:
|
|
26
|
+
npm i -g @openai/codex
|
|
27
|
+
claude mcp add codex-cli -- npx -y codex-mcp-server
|
|
28
|
+
|
|
29
|
+
Options:
|
|
30
|
+
[1] Continue without --with-codex (Claude-only solo CHALLENGE pass)
|
|
31
|
+
[2] Abort
|
|
32
|
+
```
|
|
33
|
+
If [1], disable `--with-codex` and continue with the solo CHALLENGE. If [2], stop.
|
|
34
|
+
|
|
35
|
+
---
|
|
36
|
+
|
|
37
|
+
## PHASE 3.5-CODEX: Codex rubric pass
|
|
38
|
+
|
|
39
|
+
Run after the solo CHALLENGE pass completes, before the user-facing summary.
|
|
40
|
+
|
|
41
|
+
### Step 1 — Package the post-solo plan
|
|
42
|
+
|
|
43
|
+
Use the plan as it stands after the solo rubric pass. Package the full context Codex needs:
|
|
44
|
+
|
|
45
|
+
```
|
|
46
|
+
## Problem framing (from FRAME phase)
|
|
47
|
+
[problem statement, constraints, success criteria, anti-goals]
|
|
48
|
+
|
|
49
|
+
## Confirmed facts vs assumptions
|
|
50
|
+
Confirmed by user: [list]
|
|
51
|
+
Assumptions (not yet confirmed): [list]
|
|
52
|
+
|
|
53
|
+
## Plan (post-solo-CHALLENGE)
|
|
54
|
+
Vision: [one sentence]
|
|
55
|
+
Phase 1 ([theme]): [items, dependencies, one-line descriptions]
|
|
56
|
+
Phase 2 ([theme]): ...
|
|
57
|
+
Architecture decisions: [each with what / why / alternatives]
|
|
58
|
+
Deferred to backlog: [items + reason]
|
|
59
|
+
|
|
60
|
+
## Findings from the solo rubric pass
|
|
61
|
+
[list each with: axis, quote, why, fix, whether applied]
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
Include the framing and assumptions — Codex can only judge whether the plan fits the user's reality if it sees what the user actually said.
|
|
65
|
+
|
|
66
|
+
### Step 2 — Codex challenge pass
|
|
67
|
+
|
|
68
|
+
Call `mcp__codex-cli__codex` with:
|
|
69
|
+
- `prompt`: the packaged context above, followed by the instructions below
|
|
70
|
+
- `workingDirectory`: the project root
|
|
71
|
+
- `sandbox`: `"read-only"`
|
|
72
|
+
- `reasoningEffort`: `"xhigh"` — the highest setting in the Codex enum (`none < minimal < low < medium < high < xhigh`). Always pick the top level; this is the entire reason for the flag.
|
|
73
|
+
|
|
74
|
+
Instructions to append to the packaged context. **Before sending, inline the full text of `references/challenge-rubric.md` into the prompt under a `## Rubric` heading** — Codex does not have filesystem access to this project, so Claude must ship the rubric itself. Claude already has the rubric loaded from Phase 3.5 setup.
|
|
75
|
+
|
|
76
|
+
Template for the appended instructions:
|
|
77
|
+
|
|
78
|
+
```
|
|
79
|
+
You are applying an independent rubric pass to the PLANNING document above. This is a roadmap, not code — judge the shape of the plan, not implementation details. The user has explicitly asked to be challenged because soft-pedaled plans waste their time.
|
|
80
|
+
|
|
81
|
+
## Rubric
|
|
82
|
+
[Claude inlines the full text of references/challenge-rubric.md here]
|
|
83
|
+
|
|
84
|
+
## Your job
|
|
85
|
+
- You are running AFTER a solo pass by Claude. Catch what the solo pass missed, do not just agree with what it already caught. For each existing solo finding, reply either "confirmed" or "I would frame this differently" with a reason. Then add your own findings that the solo pass missed.
|
|
86
|
+
- Use the finding format from the rubric above: Severity / Quote / Axis / Why / Fix. The Quote field is load-bearing — anchor each finding to a specific line from the plan.
|
|
87
|
+
- Respect explicit user intent. If the user confirmed something in the "Confirmed facts" section, the rubric does not override it silently. Raise the conflict as a note and let Claude surface it to the user.
|
|
88
|
+
|
|
89
|
+
End with a verdict: PASS / PASS WITH MINOR FIXES / FAIL — REVISION REQUIRED, and a one-line explanation.
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
### Step 3 — Reconcile solo and Codex findings
|
|
93
|
+
|
|
94
|
+
Merge the two finding lists:
|
|
95
|
+
- Same finding from both → keep the more specific wording, mark "confirmed by both".
|
|
96
|
+
- Codex-only → prefix `[codex]` in internal notes so the user-facing summary can show where each push came from.
|
|
97
|
+
- Solo-only → keep as-is.
|
|
98
|
+
- Conflicts (solo says X, Codex says not-X) → record both, do not silently pick one; if the conflict is material, include it as an open question in the user-facing summary.
|
|
99
|
+
|
|
100
|
+
If Codex raised CRITICAL or HIGH findings that the solo pass missed, apply the fixes to the plan before presenting the user-facing summary. If fixing would change something the user explicitly asked for, follow the "Respect explicit user intent" rule already loaded from the rubric: do not silently rewrite — surface it.
|
|
101
|
+
|
|
102
|
+
Do not loop. One Codex pass is enough. If the result is still FAIL after one pass, that is signal that the plan has structural problems the user should see directly, not signal to keep iterating in the background.
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
## Cost notes
|
|
107
|
+
|
|
108
|
+
- One Codex call at `reasoningEffort: "xhigh"` typically takes 30–90s and is not cheap. This integration is bounded: exactly one Codex call per ideation session.
|
|
109
|
+
- In Quick Add mode on a single new item, one Codex call is still worth it — small scope, huge signal, and single-item additions are exactly where workarounds slip in unnoticed.
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "devlyn-cli",
|
|
3
|
-
"version": "1.
|
|
3
|
+
"version": "1.11.0",
|
|
4
4
|
"description": "AI development toolkit for Claude Code — ideate, auto-resolve, and ship with context engineering and agent orchestration",
|
|
5
5
|
"homepage": "https://github.com/fysoul17/devlyn-cli#readme",
|
|
6
6
|
"bin": {
|
|
@@ -9,6 +9,10 @@
|
|
|
9
9
|
"files": [
|
|
10
10
|
"bin",
|
|
11
11
|
"config",
|
|
12
|
+
"!config/skills/preflight-workspace",
|
|
13
|
+
"!config/skills/preflight-workspace/**",
|
|
14
|
+
"!config/skills/devlyn:ideate-workspace",
|
|
15
|
+
"!config/skills/devlyn:ideate-workspace/**",
|
|
12
16
|
"agents-config",
|
|
13
17
|
"optional-skills",
|
|
14
18
|
"CLAUDE.md"
|