@neikyun/ciel 6.10.1 → 6.11.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (40) hide show
  1. package/assets/.claude/hooks/memory-engine.py +256 -0
  2. package/assets/commands/ciel-audit.md +42 -0
  3. package/assets/commands/ciel-create-skill.md +2 -2
  4. package/assets/commands/ciel-status.md +1 -1
  5. package/assets/platforms/opencode/.opencode/agents/ciel-improver.md +2 -2
  6. package/assets/platforms/opencode/.opencode/commands/ciel-create-skill.md +2 -2
  7. package/assets/platforms/opencode/.opencode/commands/ciel-memory-bootstrap.md +195 -0
  8. package/assets/skills/ciel/SKILL.md +2 -1
  9. package/assets/skills/workflow/adr-auto/SKILL.md +88 -0
  10. package/assets/skills/workflow/ai-failure-modes-detector/SKILL.md +180 -0
  11. package/assets/skills/workflow/ask-window/SKILL.md +119 -0
  12. package/assets/skills/workflow/avec-quoi-versioner/SKILL.md +111 -0
  13. package/assets/skills/workflow/ci-watcher/SKILL.md +194 -0
  14. package/assets/skills/workflow/critiquer-auditor/SKILL.md +135 -0
  15. package/assets/skills/workflow/critiquer-auditor/reference.md +134 -0
  16. package/assets/skills/workflow/debug-reasoning-rca/SKILL.md +174 -0
  17. package/assets/skills/workflow/depth-classifier/SKILL.md +118 -0
  18. package/assets/skills/workflow/diverge/SKILL.md +91 -0
  19. package/assets/skills/workflow/doc-validator-official/SKILL.md +196 -0
  20. package/assets/skills/workflow/evaluer-sizer/SKILL.md +112 -0
  21. package/assets/skills/workflow/faire-gatekeeper/SKILL.md +99 -0
  22. package/assets/skills/workflow/flux-narrator/SKILL.md +93 -0
  23. package/assets/skills/workflow/memoire/SKILL.md +198 -0
  24. package/assets/skills/workflow/memoire-consolidator/SKILL.md +91 -0
  25. package/assets/skills/workflow/meta-critiquer/SKILL.md +112 -0
  26. package/assets/skills/workflow/modern-patterns-checker/SKILL.md +166 -0
  27. package/assets/skills/workflow/pattern-fitness-check/SKILL.md +108 -0
  28. package/assets/skills/workflow/playwright-visual-critic/SKILL.md +98 -0
  29. package/assets/skills/workflow/pr-review-responder/SKILL.md +214 -0
  30. package/assets/skills/workflow/prouver-verifier/SKILL.md +184 -0
  31. package/assets/skills/workflow/prouver-verifier/reference.md +152 -0
  32. package/assets/skills/workflow/quoi-framer/SKILL.md +91 -0
  33. package/assets/skills/workflow/relire-critic/SKILL.md +99 -0
  34. package/assets/skills/workflow/security-regression-check/SKILL.md +86 -0
  35. package/assets/skills/workflow/self-consistency-verifier/SKILL.md +85 -0
  36. package/assets/skills/workflow/spike-mode/SKILL.md +101 -0
  37. package/assets/skills/workflow/stride-analyzer/SKILL.md +96 -0
  38. package/assets/skills/workflow/stride-analyzer/reference.md +144 -0
  39. package/assets/skills/workflow/test-strategy-vitest-playwright/SKILL.md +119 -0
  40. package/package.json +1 -1
@@ -0,0 +1,88 @@
1
+ ---
2
+ name: adr-auto
3
+ description: How to document architectural decisions automatically in Ciel v5 (etape 12). After FAIRE but before RELIRE, if the task involved a significant architectural decision, write an ADR (Architecture Decision Record) to docs/adrs/. Prevents knowledge loss.
4
+ ---
5
+
6
+ # Automatic ADR — Document Decisions in Real Time (Ciel v5)
7
+
8
+ ## What this covers
9
+
10
+ How to document architectural decisions during the Ciel v5 pipeline (etape 12: ADR). After FAIRE but before RELIRE, if the task involved a significant architectural decision, write an ADR. The decision is documented while fresh, not months later.
11
+
12
+ ## Core principle
13
+
14
+ **If the decision was non-trivial, document WHY.** Code shows WHAT. ADRs show WHY. Without ADRs, future developers (or future you) will wonder why the code is the way it is.
15
+
16
+ ## When to write an ADR
17
+
18
+ Write an ADR when the task involves:
19
+ - Adding a new dependency/library
20
+ - Choosing between two technologies
21
+ - Changing a database schema
22
+ - Adopting a design pattern
23
+ - Making a performance trade-off
24
+ - Changing the build/deploy pipeline
25
+ - Any decision with long-term consequences
26
+
27
+ Do NOT write an ADR for:
28
+ - Bug fixes (tests document the fix)
29
+ - Refactoring without semantic change
30
+ - Renames/reorganizations
31
+ - Dependency upgrades (changelog suffices)
32
+
33
+ ## ADR format (based on Michael Nygard's template)
34
+
35
+ ```
36
+ # ADR-<NNN>: <Title>
37
+
38
+ ## Status
39
+
40
+ <proposed | accepted | deprecated | superseded by ADR-NNN>
41
+
42
+ ## Context
43
+
44
+ <What is the issue that we're seeing that is motivating this decision or change? 2-3 sentences.>
45
+
46
+ ## Decision
47
+
48
+ <What is the change that we're proposing and/or doing? 1-2 sentences.>
49
+
50
+ ## Consequences
51
+
52
+ <What becomes easier or harder to do because of this change? 2-3 items.>
53
+
54
+ ## References
55
+
56
+ <Link to relevant docs, tickets, or PRs>
57
+ ```
58
+
59
+ ## File naming
60
+
61
+ `docs/adrs/<NNN>-<kebab-case-title>.md`
62
+
63
+ Start at 001 and increment.
64
+
65
+ ## How to trigger (Ciel v5)
66
+
67
+ In the Ciel pipeline (etape 12), during ADR:
68
+ 1. Check if the task involved a significant decision (see list above)
69
+ 2. If yes -> write `docs/adrs/<NNN>-<title>.md`
70
+ 3. Update `.ciel/map.json` to reference the new ADR
71
+ 4. Reference the ADR in the RELIRE submission so the critic can check it
72
+
73
+ ## Common rationalizations
74
+
75
+ | Rationalization | Reality |
76
+ |---|---|
77
+ | "The code is self-documenting" | Code shows WHAT. ADRs show WHY. Six months from now, "why did we choose this" is not visible in the code. |
78
+ | "I'll add it later" | Later is when the decision is forgotten and the context is lost. Write it now or it never gets written. |
79
+ | "This decision is too small for an ADR" | If you had to think about it for more than 30 seconds, it's big enough for an ADR. |
80
+ | "Nobody reads ADRs anyway" | Nobody reads them until they need to undo a decision and can't figure out why it was made. Then they're invaluable. |
81
+
82
+ ## How to verify
83
+
84
+ - [ ] ADR written for every significant decision?
85
+ - [ ] No ADR written for trivial changes?
86
+ - [ ] ADR includes context, decision, consequences?
87
+ - [ ] Map updated with ADR reference?
88
+ - [ ] ADR committed with the code?
@@ -0,0 +1,180 @@
1
+ ---
2
+ name: ai-failure-modes-detector
3
+ description: Detects the six canonical failure modes of LLM-generated code — invented APIs, hallucinated dependencies, version drift, async/sync mismatch, confident-wrong logic, and extrinsic hallucination (plausible but unverifiable output). Runs self-consistency triple-generation checks, AST-based dependency audits, and uncertainty scoring. Triggers BEFORE merging agent-authored code, especially when the author is an LLM. Partners with doc-validator-official (API-level) and self-consistency-verifier (semantic-level).
4
+ allowed-tools: Read, Grep, Glob, Bash
5
+ ---
6
+
7
+ # ai-failure-modes-detector — Catch confident-wrong before it lands
8
+
9
+ LLM-generated code compiles more often than it's correct. Six failure modes account for >90% of post-merge incidents in agentic PRs (ISSTA 2025). This skill runs each check systematically.
10
+
11
+ ---
12
+
13
+ ## Inputs (infer before asking — see orchestrator's Autonomy protocol)
14
+
15
+ ```
16
+ CODE_UNDER_REVIEW: [file paths OR diff hunk]
17
+ AUTHOR: [human | LLM | mixed]
18
+ PROPOSED_DEPS: [new dependencies being added, if any]
19
+ TEST_COVERAGE: [files that have tests | files without]
20
+ ```
21
+
22
+ ### Auto-inference sources (exhaust BEFORE asking the user)
23
+
24
+ - **CODE_UNDER_REVIEW** → `git diff HEAD~1` (last commit) or `git diff main...HEAD` (branch diff) — usually the intent. If user said "this file", extract from prompt.
25
+ - **AUTHOR** → check the last commit's message / co-author trailer. `Co-Authored-By: Claude` or `Generated with Claude Code` → LLM. Otherwise human. If unsure, assume `mixed` (safer default).
26
+ - **PROPOSED_DEPS** → `git diff HEAD~1 -- package.json go.mod requirements.txt` → list added entries. Zero added → skip dep-hallucination check.
27
+ - **TEST_COVERAGE** → for each changed file in CODE_UNDER_REVIEW, check if a corresponding `*.test.*` / `*_test.go` / `test_*.py` exists next to it.
28
+
29
+ Never ask the user for AUTHOR — always inferable from git. Never ask for TEST_COVERAGE — always checkable via filesystem.
30
+
31
+ ---
32
+
33
+ ## The six failure modes
34
+
35
+ ### 1. Invented APIs
36
+
37
+ Function/class/method that doesn't exist in the library at the pinned version.
38
+
39
+ **Detection**:
40
+ - Grep every import and every method call on imported symbols
41
+ - Cross-reference with `node_modules/<pkg>/package.json` + type definitions
42
+ - For dynamic imports (`await import()`), inspect at runtime if possible
43
+
44
+ **Signal**: import resolves but `<symbol>` not in the `.d.ts` or `__init__.py`.
45
+
46
+ ### 2. Hallucinated dependencies
47
+
48
+ `npm package` or `pip package` that doesn't exist on the registry (or typo-squat).
49
+
50
+ **Detection**:
51
+ - For each new dep in PROPOSED_DEPS: `npm view <pkg> --json` or `pip index versions <pkg>`
52
+ - Check publisher reputation (weekly downloads, last publish date, repo link present)
53
+ - Typo-squat check: Levenshtein distance ≤ 2 from a popular package name is SUSPICIOUS
54
+
55
+ **Signal**: registry returns 404, or package has < 100 downloads/week with no repo.
56
+
57
+ ### 3. Version drift
58
+
59
+ Code uses an API that exists but at a different version than pinned.
60
+
61
+ **Detection**:
62
+ - For each external API call, check "Added in vX.Y" / "Deprecated in vX.Y" metadata
63
+ - Compare against pinned version in lockfile
64
+
65
+ **Signal**: API exists in v2, code pins v1 — silently broken.
66
+
67
+ ### 4. Async/sync mismatch
68
+
69
+ Sync call in an async codebase or a Promise-returning function not awaited.
70
+
71
+ **Detection** (TS):
72
+ - `@typescript-eslint/no-floating-promises`
73
+ - Grep for `fetch(`, `fs.readFileSync` (sync in async) or unawaited `async` functions
74
+ - Any `Promise<T>` returned from a function whose callers don't `await`
75
+
76
+ **Detection** (Python):
77
+ - Sync `requests.get()` inside an `async def`
78
+ - `asyncio.run()` called inside an event loop
79
+
80
+ **Signal**: type checker emits "Promise returned but not awaited" OR sync call blocks in async context.
81
+
82
+ ### 5. Confident-wrong logic
83
+
84
+ Code is syntactically and typing-wise valid, passes linting, but is semantically wrong:
85
+ - Off-by-one on pagination
86
+ - Wrong operator (`>=` where `>` needed)
87
+ - Negated boolean
88
+ - Swapped arguments of same type
89
+
90
+ **Detection**:
91
+ - Run existing tests (if present) — failing tests is the first signal
92
+ - Invariant check: can you state in 1 sentence what the code guarantees? Does it actually guarantee it?
93
+ - For any numerical boundary, ask: "off-by-one in either direction — which breaks?"
94
+
95
+ **Signal**: behavior divergence between stated goal and actual execution.
96
+
97
+ ### 6. Extrinsic hallucination
98
+
99
+ Output is plausible but references facts outside the code that cannot be verified:
100
+ - Cites a spec section that doesn't exist
101
+ - Comments claim "per RFC 7231 §5.3" when section 5.3 doesn't cover that
102
+ - Error codes invented (`ERR_USER_QUOTA_EXCEEDED` — is that really thrown?)
103
+
104
+ **Detection**:
105
+ - Every code comment with a source claim → spot-check
106
+ - Every user-facing string (error codes, log messages) → grep for prior use in the codebase
107
+
108
+ **Signal**: claim cannot be corroborated.
109
+
110
+ ---
111
+
112
+ ## Report format
113
+
114
+ ```
115
+ ## AI-FAILURE-MODES VERDICT
116
+
117
+ ### Author
118
+ LLM (auto-detected via commit message pattern | user-declared)
119
+
120
+ ### Findings by mode
121
+ 1. Invented APIs:
122
+ [BLOCK] src/auth.ts:42 — `jwt.verifyStrict()` not in jsonwebtoken@9.0.2 (use `verify()` with `algorithms` option)
123
+
124
+ 2. Hallucinated deps:
125
+ (none — all 3 new deps exist on npm, >10k weekly downloads)
126
+
127
+ 3. Version drift:
128
+ [WARN] src/db.ts:18 — `drizzle.innerJoin()` added in v0.30, pinned 0.29 — upgrade drizzle-orm
129
+
130
+ 4. Async/sync mismatch:
131
+ [BLOCK] src/upload.ts:55 — `fs.writeFileSync()` inside async handler — blocks event loop
132
+
133
+ 5. Confident-wrong:
134
+ [WARN] src/pagination.ts:22 — `offset = page * pageSize` — off-by-one on page=0
135
+
136
+ 6. Extrinsic:
137
+ [INFO] src/rate-limit.ts:10 — comment cites "per RFC 6585 §4" — RFC 6585 does not have §4; 429 is §4 of RFC 6585 (comment is right, citation format wrong)
138
+
139
+ ### Summary
140
+ BLOCK: 2
141
+ WARN: 2
142
+ INFO: 1
143
+ ```
144
+
145
+ ---
146
+
147
+ ## Guardrails
148
+
149
+ - **BLOCK means don't merge** — invented APIs, hallucinated deps, and async/sync mismatches are production-breaking.
150
+ - **WARN means discuss in review** — not auto-blocking but requires human acknowledgment.
151
+ - **Run against diff, not whole repo** — old code isn't the subject; the new change is.
152
+ - **When tests are absent**, confidence in "confident-wrong" findings drops — request tests be added before clearing the review.
153
+ - **Don't false-positive on stubs** — intentional mocks in `__mocks__/` or `test-helpers/` may reference not-yet-implemented APIs; verify context.
154
+ - **Typo-squat false positives**: popular packages sometimes have close cousins (`request` vs `request-promise`) — check download count AND repo history before flagging.
155
+
156
+ ---
157
+
158
+ ## How to verify
159
+
160
+ - [ ] All 6 failure modes checked (invented APIs, hallucinated deps, version drift, async/sync, confident-wrong, extrinsic)?
161
+ - [ ] Each finding has evidence (file:line or URL)?
162
+ - [ ] VERDICT issued (CLEAN / FINDINGS)?
163
+ - [ ] Author identified (LLM vs human)?
164
+ - [ ] External API calls validated against official docs?
165
+
166
+ ## When triggered
167
+
168
+ - Post-write hook when AUTHOR=LLM and task is Standard/Critical
169
+ - Before any PR merge authored wholly or partially by an agent
170
+ - After `@ciel-explorer` completes CODEBASE review
171
+ - User command: "audit this code for AI mistakes"
172
+
173
+ ---
174
+
175
+ ## References
176
+
177
+ - ISSTA 2025 — "LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation"
178
+ - arxiv 2601.19106 — "Detecting and Correcting Hallucinations in LLM-Generated Code"
179
+ - arxiv 2404.00971 — "Beyond Functional Correctness"
180
+ - Anthropic 2604.08906 — agentic framework failure taxonomy
@@ -0,0 +1,119 @@
1
+ ---
2
+ name: ask-window
3
+ description: How to use the ASK window in Ciel v5 — before coding, clarify ambiguities using the question tool (OpenCode) or plan mode (Claude Code). Covers etapes 3 (ASK) and 10 (ASK2) of the pipeline. Prevents coding on assumptions.
4
+ ---
5
+
6
+ # ASK Window — Clarify Before You Code (Ciel v5)
7
+
8
+ ## What this covers
9
+
10
+ How to use the ASK window in the Ciel v5 pipeline. Before coding, the agent must ask clarifying questions rather than assuming. This skill covers etapes 3 (ASK after QUOI) and 10 (ASK2 after EVALUER).
11
+
12
+ ## Core principle
13
+
14
+ **Do not code on assumptions.** When requirements are ambiguous, parameters undefined, or choices implicit -> ask. Use the question tool (OpenCode) or plan mode (Claude Code).
15
+
16
+ ## Two modes — when to use which
17
+
18
+ ```
19
+ MODE ASK (step 3) → "What should I build?" → after QUOI, before research
20
+ MODE ASK2 (step 10) → "Should I build this way?" → after EVALUER, before coding
21
+ ```
22
+
23
+ ASK is about **requirements** — clarify what to build. ASK2 is about **the plan** — validate how to build it.
24
+
25
+ ### ASK (step 3) — clarify requirements
26
+
27
+ After QUOI, before any research or coding. Questions are about the **what**, not the **how**:
28
+
29
+ - Requirements: "Is the email field required?"
30
+ - Ambiguities: "Session cookie or JWT?"
31
+ - Assumptions: "I assume the database is PostgreSQL, correct?"
32
+ - Missing info: "What is the expected throughput?"
33
+ - Scope boundaries: "Does this include the admin panel?"
34
+
35
+ ### ASK2 (step 10) — validate the plan
36
+
37
+ After EVALUER, before FAIRE. Questions are about the **approach**, not the **requirements**:
38
+
39
+ - Approach validation: "I'm going with approach A because X. OK?"
40
+ - Trade-off validation: "Approach A is simpler but B is more flexible. OK with A?"
41
+ - Risk confirmation: "The main risk is X. Acceptable?"
42
+ - Effort check: "This will take ~2 hours. OK?"
43
+
44
+ ## How to ask
45
+
46
+ ### OpenCode: use the `question` tool
47
+
48
+ The `question` tool is a built-in OpenCode tool. Each question includes:
49
+ 1. A header (category of question)
50
+ 2. The question text
51
+ 3. A list of options (at least 2)
52
+ 4. A custom answer option
53
+
54
+ Example:
55
+ ```
56
+ Tool: question
57
+ Parameters:
58
+ header: "Database choice"
59
+ question: "Which database should we use for the new feature?"
60
+ options: ["PostgreSQL (existing)", "SQLite (simpler)", "MySQL (new)"]
61
+ ```
62
+
63
+ ### Claude Code: use plan mode
64
+
65
+ In Claude Code, switch to plan mode (Tab key), then:
66
+ 1. List your assumptions explicitly
67
+ 2. Ask questions one at a time
68
+ 3. Wait for answers before proceeding
69
+ 4. Update the plan based on responses
70
+
71
+ ## Structure of a good question
72
+
73
+ ```
74
+ QUESTION CATEGORY: <requirement | assumption | tradeoff | risk | scope>
75
+
76
+ Context: <1-2 sentences explaining why you're asking>
77
+
78
+ Question: <clear, specific question>
79
+
80
+ Options:
81
+ A: <option>
82
+ B: <option>
83
+ C: <other> (if relevant)
84
+ ```
85
+
86
+ ## What NOT to ask
87
+
88
+ - Things you can discover yourself (read the code, check package.json)
89
+ - Trivial preferences that don't affect the design (naming, formatting)
90
+ - The same question twice (check your previous answers)
91
+ - Questions you already have the answer to (check .ciel/memory.json, overlay)
92
+
93
+ ## Output format
94
+
95
+ After the ASK phase, include in the plan:
96
+
97
+ ```
98
+ ## Questions asked (ASK)
99
+
100
+ 1. <question> -> <answer selected>
101
+ 2. <question> -> <custom answer>
102
+ ```
103
+
104
+ ## Common rationalizations
105
+
106
+ | Rationalization | Reality |
107
+ |---|---|
108
+ | "I'll just assume and fix it later" | Later is when it's in production and costs 10x to fix. Asking now costs 30 seconds. |
109
+ | "The user would have told me if it mattered" | Users don't know what they don't specify. Assumptions are silent bugs waiting to surface. |
110
+ | "I can figure it out from the code" | Code tells you WHAT, not WHY. If there are two valid approaches, code can't tell you which one the project prefers. |
111
+ | "Asking makes me look uncertain" | Coding on wrong assumptions makes you look incompetent. Asking is what senior engineers do. |
112
+
113
+ ## How to verify
114
+
115
+ - [ ] All unclear requirements have been asked about?
116
+ - [ ] Assumptions have been validated (not silently filled)?
117
+ - [ ] Trade-offs have been offered to the user?
118
+ - [ ] Questions are specific, not vague ("what do you think?")
119
+ - [ ] Answers are captured in the plan?
@@ -0,0 +1,111 @@
1
+ ---
2
+ name: avec-quoi-versioner
3
+ description: Reads actual installed library versions from package.json, build.gradle, go.mod, Cargo.toml, pyproject.toml, Gemfile.lock — never trusts memory or assumptions. Loads ciel-overlay.md if present for project-specific stack context. Invoked before research to ensure all subsequent docs lookups target the correct versions.
4
+ allowed-tools: Read, Grep, Glob, Bash
5
+ ---
6
+
7
+ # avec-quoi-versioner — Read real installed versions
8
+
9
+ Step 2 of CRÉER. The research quality is bounded by version accuracy. A skill that looks up "Ktor 2.x docs" when the project runs Ktor 3.x produces anti-patterns.
10
+
11
+ ---
12
+
13
+ ## Process
14
+
15
+ ### 1. Detect package manager(s)
16
+
17
+ Scan project root for the following files (in order):
18
+
19
+ | File | Stack |
20
+ |------|-------|
21
+ | `package.json` + `package-lock.json` | npm / Node.js |
22
+ | `package.json` + `yarn.lock` | yarn |
23
+ | `package.json` + `pnpm-lock.yaml` | pnpm |
24
+ | `package.json` + `bun.lockb` | bun |
25
+ | `build.gradle.kts` / `build.gradle` | JVM / Gradle |
26
+ | `pom.xml` | Maven |
27
+ | `go.mod` + `go.sum` | Go |
28
+ | `Cargo.toml` + `Cargo.lock` | Rust |
29
+ | `pyproject.toml` + `poetry.lock` / `uv.lock` | Python |
30
+ | `requirements.txt` | Python (pip) |
31
+ | `Gemfile` + `Gemfile.lock` | Ruby |
32
+ | `composer.json` | PHP |
33
+ | `Package.swift` / `Package.resolved` | Swift |
34
+
35
+ Multiple lockfiles may exist (monorepo). Read them all.
36
+
37
+ ### 2. Extract exact versions (not semver ranges)
38
+
39
+ For each relevant dependency in the task scope:
40
+
41
+ - Read the **lockfile** for the pinned version (not `package.json`'s range)
42
+ - For Gradle, run `./gradlew dependencies` if needed, or read `gradle.properties`
43
+ - For Go, `go.mod` already pins; verify with `go list -m all`
44
+ - For Maven, effective POM: `mvn help:effective-pom`
45
+
46
+ ### 3. Load ciel-overlay.md
47
+
48
+ If present at project root, extract:
49
+
50
+ - `## Stack` section — project's declared stack
51
+ - `## Versions` section — URLs to docs
52
+ - Any project-specific rules in `## Règles projet-spécifiques`
53
+
54
+ ### 4. State assumptions explicitly
55
+
56
+ For anything NOT verified from lockfile:
57
+
58
+ - "Assuming build tool X because [reason]."
59
+ - "Assuming PostgreSQL is running on default port because [reason]."
60
+
61
+ These assumptions must be flagged for `researcher` to verify.
62
+
63
+ ---
64
+
65
+ ## Output format
66
+
67
+ ```
68
+ ## AVEC QUOI
69
+
70
+ Stack detected:
71
+ - Frontend: <framework> <version> (from <file>)
72
+ - Backend: <framework> <version> (from <file>)
73
+ - Database: <type> <version> (from <file or overlay>)
74
+ - Test: <framework> <version> (from <file>)
75
+ - Build: <tool> <version>
76
+
77
+ Overlay:
78
+ - [Loaded: yes/no]
79
+ - [Relevant sections: Stack, Versions, Règles, Leçons]
80
+
81
+ Assumptions (NOT from lockfile):
82
+ - <assumption> — <reason>
83
+
84
+ Docs URLs (from overlay):
85
+ - <lib>: <url>
86
+ ```
87
+
88
+ ---
89
+
90
+ ## Guardrails
91
+
92
+ - **Never assume a version** — if lockfile is absent, state "version unknown" and flag it
93
+ - **Range vs pinned**: always report the pinned version from the lockfile, not the `^1.2.3` range from the manifest
94
+ - **Monorepo caution**: multiple lockfiles may diverge across packages. Specify which package the version applies to.
95
+ - **Don't guess URLs**: only report doc URLs from the overlay. Let `researcher` agent WebSearch for the rest.
96
+
97
+ ---
98
+
99
+ ## How to verify
100
+
101
+ - [ ] Versions read from lock files (not package.json ranges)?
102
+ - [ ] ciel-overlay.md consulted for project-specific versions?
103
+ - [ ] Framework detected (React/Vue/Svelte/Ktor/Express/etc)?
104
+ - [ ] Version gaps flagged (installed vs latest)?
105
+ - [ ] Overlay updated if new versions discovered?
106
+
107
+ ## When triggered
108
+
109
+ - Standard/Critical tasks, immediately after `quoi-framer`
110
+ - Before dispatching `researcher` agent (research quality depends on version accuracy)
111
+ - When user asks "what versions are we on?" or the task mentions a specific library
@@ -0,0 +1,194 @@
1
+ ---
2
+ name: ci-watcher
3
+ description: Streams GitHub Actions via `gh run watch`, classifies failures flaky (≥15% fail rate on main → auto-`gh run rerun --failed`) vs real (hand off to debug-reasoning-rca). Invoke after pr-opener, before pr-merger, or on "CI stuck" / "why is CI red" / "flaky test". Inline.
4
+ allowed-tools: Bash, Read
5
+ context: inline
6
+ ---
7
+
8
+ # ci-watcher — Watch CI, distinguish flaky from broken, retry smart
9
+
10
+ `prouver-verifier` takes a single-point snapshot of CI state. `ci-watcher` watches over time: streams the run, waits for completion, classifies failures as flaky vs real, retries only what's safe.
11
+
12
+ ---
13
+
14
+ ## Inputs
15
+
16
+ ```
17
+ BRANCH: [current branch — from git rev-parse]
18
+ PR_NUMBER: [optional — derived if branch has an open PR]
19
+ WORKFLOW: [optional — filter to specific workflow name; default all]
20
+ MODE: [watch | snapshot — default watch; snapshot = single poll + return]
21
+ MAX_RETRIES: [default 1 — for flaky-detected failures only]
22
+ FLAKY_THRESHOLD: [default 15 — % fail rate on main that classifies as flaky]
23
+ ```
24
+
25
+ ### Auto-inference sources
26
+
27
+ - **BRANCH** → `git rev-parse --abbrev-ref HEAD`
28
+ - **PR_NUMBER** → `gh pr view --json number --jq .number 2>/dev/null`
29
+ - **WORKFLOW** → all workflows that ran on the branch
30
+
31
+ ---
32
+
33
+ ## Preflight
34
+
35
+ ```bash
36
+ gh auth status 2>&1 | grep -q "Logged in" || exit 1
37
+ BRANCH=${BRANCH:-$(git rev-parse --abbrev-ref HEAD)}
38
+
39
+ # Confirm branch has at least one run
40
+ LATEST=$(gh run list --branch="$BRANCH" --limit=1 --json databaseId,status,conclusion --jq '.[0] // empty')
41
+ [ -z "$LATEST" ] && { echo "No runs found for branch $BRANCH — push first"; exit 1; }
42
+ ```
43
+
44
+ ---
45
+
46
+ ## Process
47
+
48
+ ### 1. Stream or snapshot
49
+
50
+ **Watch mode (default)** — stream until completion:
51
+
52
+ ```bash
53
+ RUN_ID=$(gh run list --branch="$BRANCH" --limit=1 --json databaseId --jq '.[0].databaseId')
54
+ gh run watch "$RUN_ID" --exit-status
55
+ RESULT=$?
56
+ ```
57
+
58
+ `--exit-status` returns non-zero on run failure. `gh run watch` streams logs as they appear.
59
+
60
+ **Snapshot mode** — single poll:
61
+
62
+ ```bash
63
+ gh run list --branch="$BRANCH" --limit=5 --json databaseId,name,status,conclusion,url
64
+ ```
65
+
66
+ ### 2. On failure — classify flaky vs real
67
+
68
+ ```bash
69
+ # Get failing jobs for this run
70
+ FAILED_JOBS=$(gh run view "$RUN_ID" --json jobs --jq '.jobs[] | select(.conclusion == "failure") | .name')
71
+
72
+ # For each failed job, check history on the base branch
73
+ BASE=$(gh pr view "$PR_NUMBER" --json baseRefName --jq .baseRefName 2>/dev/null || echo "main")
74
+
75
+ for JOB in $FAILED_JOBS; do
76
+ # Last 50 runs on base branch for same workflow
77
+ WORKFLOW=$(gh run view "$RUN_ID" --json workflowName --jq .workflowName)
78
+ FAIL_RATE=$(gh run list \
79
+ --branch="$BASE" \
80
+ --workflow="$WORKFLOW" \
81
+ --limit=50 \
82
+ --json conclusion \
83
+ --jq '[.[] | select(.conclusion == "failure")] | length')
84
+
85
+ FAIL_PCT=$((FAIL_RATE * 100 / 50))
86
+
87
+ if [ "$FAIL_PCT" -ge "$FLAKY_THRESHOLD" ]; then
88
+ echo "Job '$JOB' fails ${FAIL_PCT}% of the time on $BASE — CLASSIFIED FLAKY"
89
+ FLAKY_JOBS+=("$JOB")
90
+ else
91
+ echo "Job '$JOB' fails ${FAIL_PCT}% of the time on $BASE — CLASSIFIED REAL FAILURE"
92
+ REAL_FAILURES+=("$JOB")
93
+ fi
94
+ done
95
+ ```
96
+
97
+ **Flaky threshold rationale**: 15% = 7-8 failures in 50 runs. Below that, a single failure is likely the PR's fault. Above, it's environmental/test-harness instability.
98
+
99
+ ### 3. Retry flaky jobs (up to MAX_RETRIES)
100
+
101
+ ```bash
102
+ if [ ${#FLAKY_JOBS[@]} -gt 0 ] && [ "$RETRY_COUNT" -lt "$MAX_RETRIES" ]; then
103
+ echo "Retrying flaky jobs (attempt $((RETRY_COUNT+1))/$MAX_RETRIES)"
104
+ gh run rerun "$RUN_ID" --failed
105
+ RETRY_COUNT=$((RETRY_COUNT+1))
106
+ # Re-enter step 1 (watch the rerun)
107
+ fi
108
+ ```
109
+
110
+ `--failed` only reruns failed jobs (saves CI minutes).
111
+
112
+ ### 4. Extract log excerpt for real failures
113
+
114
+ For handoff to `debug-reasoning-rca`:
115
+
116
+ ```bash
117
+ for JOB in $REAL_FAILURES; do
118
+ JOB_ID=$(gh run view "$RUN_ID" --json jobs --jq ".jobs[] | select(.name == \"$JOB\") | .databaseId")
119
+
120
+ # Last 50 lines of the failing step
121
+ gh run view --job="$JOB_ID" --log-failed | tail -50
122
+ done
123
+ ```
124
+
125
+ ### 5. Emit output
126
+
127
+ ```
128
+ [CI WATCHER]
129
+ Run: <URL>
130
+ Status: <success | failure | in_progress>
131
+ Duration: <Xm Ys>
132
+
133
+ Jobs:
134
+ [OK] build
135
+ [OK] lint
136
+ [WARN] integration-tests — FLAKY (fails 18% on main, retried — now green)
137
+ [FAIL] unit-tests — REAL FAILURE (fails 2% on main — investigate)
138
+
139
+ Handoff (if real failures):
140
+ - debug-reasoning-rca with SYMPTOM=<failing test name> + LOG excerpt
141
+ ```
142
+
143
+ ---
144
+
145
+ ## Guardrails
146
+
147
+ - **MAX_RETRIES=1 by default** — a flaky test that fails twice in a row is likely not flaky. Don't spam retries.
148
+ - **Never retry real failures** — the retry mechanism is ONLY for jobs classified flaky. Real failures need a code fix.
149
+ - **Never retry pre-merge checks on main** — only PR branches. Retrying on main risks hiding real regressions.
150
+ - **Budget-aware**: large rerun loops burn CI minutes. Log estimated minutes cost before retry on repos with tight budgets.
151
+ - **Respect timeouts**: `gh run watch` can hang if a job hangs. Wrap in `timeout 1800 gh run watch` for 30-min ceiling.
152
+ - **Flaky classification is per-job, not per-run**: if 3 of 5 jobs are flaky but 1 is real, DO NOT retry — fix the real one first.
153
+ - **Store flaky detections** — append to `.claude/flaky-tests.log` (optional, per-project) so patterns surface across sessions.
154
+
155
+ ---
156
+
157
+ ## When triggered
158
+
159
+ - After `pr-opener` in Standard pipeline (step 11 post-insertion)
160
+ - Before `pr-merger` as CI-green verification (replaces inline `gh run list` check)
161
+ - User says: "watch CI", "is CI green?", "CI is flaky", "rerun failed jobs"
162
+ - `prouver-verifier` detects a red CI and needs disambiguation
163
+
164
+ ---
165
+
166
+ ## Anti-pattern
167
+
168
+ ```
169
+ ❌ Failed → rerun blindly → rerun → rerun → real bug hidden, minutes wasted
170
+ ✅ Failed → classify (fail % on main) → retry only flaky → real fail → debug-reasoning-rca
171
+ ```
172
+
173
+ ```
174
+ ❌ sleep 300 && gh run list # blocked by harness; also cache-cold
175
+ ✅ gh run watch --exit-status # streams, no sleep
176
+ ```
177
+
178
+ ---
179
+
180
+ ## Handoff
181
+
182
+ - **If all green** → `pr-merger` can proceed
183
+ - **If real failure** → `debug-reasoning-rca` via `@ciel-critic` with log excerpt as SYMPTOM + failing job as SCOPE
184
+ - **If flaky detected + retry succeeded** → proceed to `pr-merger`, log flaky for future `/ciel-improve` signal
185
+ - **If flaky + retry failed** → escalate to user (flaky-turned-real or real-misclassified)
186
+
187
+ ---
188
+
189
+ ## References
190
+
191
+ - `gh run watch` — cli.github.com/manual/gh_run_watch
192
+ - `gh run rerun --failed` — cli.github.com/manual/gh_run_rerun
193
+ - Flaky test classification — Google's 2020 paper "Taming Google-scale continuous testing" (15% threshold baseline)
194
+ - Ciel pipeline: pr-opener → ci-watcher → (flaky? retry : debug-reasoning-rca) → pr-merger