npm - @neikyun/ciel - Versions diffs - 6.10.1 → 6.11.1 - Mend

@neikyun/ciel 6.10.1 → 6.11.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (40) hide show

package/assets/skills/workflow/adr-auto/SKILL.md ADDED Viewed

@@ -0,0 +1,88 @@
+---
+name: adr-auto
+description: How to document architectural decisions automatically in Ciel v5 (etape 12). After FAIRE but before RELIRE, if the task involved a significant architectural decision, write an ADR (Architecture Decision Record) to docs/adrs/. Prevents knowledge loss.
+---
+# Automatic ADR — Document Decisions in Real Time (Ciel v5)
+## What this covers
+How to document architectural decisions during the Ciel v5 pipeline (etape 12: ADR). After FAIRE but before RELIRE, if the task involved a significant architectural decision, write an ADR. The decision is documented while fresh, not months later.
+## Core principle
+**If the decision was non-trivial, document WHY.** Code shows WHAT. ADRs show WHY. Without ADRs, future developers (or future you) will wonder why the code is the way it is.
+## When to write an ADR
+Write an ADR when the task involves:
+- Adding a new dependency/library
+- Choosing between two technologies
+- Changing a database schema
+- Adopting a design pattern
+- Making a performance trade-off
+- Changing the build/deploy pipeline
+- Any decision with long-term consequences
+Do NOT write an ADR for:
+- Bug fixes (tests document the fix)
+- Refactoring without semantic change
+- Renames/reorganizations
+- Dependency upgrades (changelog suffices)
+## ADR format (based on Michael Nygard's template)
+```
+# ADR-<NNN>: <Title>
+## Status
+<proposed | accepted | deprecated | superseded by ADR-NNN>
+## Context
+<What is the issue that we're seeing that is motivating this decision or change? 2-3 sentences.>
+## Decision
+<What is the change that we're proposing and/or doing? 1-2 sentences.>
+## Consequences
+<What becomes easier or harder to do because of this change? 2-3 items.>
+## References
+<Link to relevant docs, tickets, or PRs>
+```
+## File naming
+`docs/adrs/<NNN>-<kebab-case-title>.md`
+Start at 001 and increment.
+## How to trigger (Ciel v5)
+In the Ciel pipeline (etape 12), during ADR:
+1. Check if the task involved a significant decision (see list above)
+2. If yes -> write `docs/adrs/<NNN>-<title>.md`
+3. Update `.ciel/map.json` to reference the new ADR
+4. Reference the ADR in the RELIRE submission so the critic can check it
+## Common rationalizations
+| Rationalization | Reality |
+|---|---|
+| "The code is self-documenting" | Code shows WHAT. ADRs show WHY. Six months from now, "why did we choose this" is not visible in the code. |
+| "I'll add it later" | Later is when the decision is forgotten and the context is lost. Write it now or it never gets written. |
+| "This decision is too small for an ADR" | If you had to think about it for more than 30 seconds, it's big enough for an ADR. |
+| "Nobody reads ADRs anyway" | Nobody reads them until they need to undo a decision and can't figure out why it was made. Then they're invaluable. |
+## How to verify
+- [ ] ADR written for every significant decision?
+- [ ] No ADR written for trivial changes?
+- [ ] ADR includes context, decision, consequences?
+- [ ] Map updated with ADR reference?
+- [ ] ADR committed with the code?

package/assets/skills/workflow/ai-failure-modes-detector/SKILL.md ADDED Viewed

@@ -0,0 +1,180 @@
+---
+name: ai-failure-modes-detector
+description: Detects the six canonical failure modes of LLM-generated code — invented APIs, hallucinated dependencies, version drift, async/sync mismatch, confident-wrong logic, and extrinsic hallucination (plausible but unverifiable output). Runs self-consistency triple-generation checks, AST-based dependency audits, and uncertainty scoring. Triggers BEFORE merging agent-authored code, especially when the author is an LLM. Partners with doc-validator-official (API-level) and self-consistency-verifier (semantic-level).
+allowed-tools: Read, Grep, Glob, Bash
+---
+# ai-failure-modes-detector — Catch confident-wrong before it lands
+LLM-generated code compiles more often than it's correct. Six failure modes account for >90% of post-merge incidents in agentic PRs (ISSTA 2025). This skill runs each check systematically.
+---
+## Inputs (infer before asking — see orchestrator's Autonomy protocol)
+```
+CODE_UNDER_REVIEW: [file paths OR diff hunk]
+AUTHOR: [human | LLM | mixed]
+PROPOSED_DEPS: [new dependencies being added, if any]
+TEST_COVERAGE: [files that have tests | files without]
+```
+### Auto-inference sources (exhaust BEFORE asking the user)
+- **CODE_UNDER_REVIEW** → `git diff HEAD~1` (last commit) or `git diff main...HEAD` (branch diff) — usually the intent. If user said "this file", extract from prompt.
+- **AUTHOR** → check the last commit's message / co-author trailer. `Co-Authored-By: Claude` or `Generated with Claude Code` → LLM. Otherwise human. If unsure, assume `mixed` (safer default).
+- **PROPOSED_DEPS** → `git diff HEAD~1 -- package.json go.mod requirements.txt` → list added entries. Zero added → skip dep-hallucination check.
+- **TEST_COVERAGE** → for each changed file in CODE_UNDER_REVIEW, check if a corresponding `*.test.*` / `*_test.go` / `test_*.py` exists next to it.
+Never ask the user for AUTHOR — always inferable from git. Never ask for TEST_COVERAGE — always checkable via filesystem.
+---
+## The six failure modes
+### 1. Invented APIs
+Function/class/method that doesn't exist in the library at the pinned version.
+**Detection**:
+- Grep every import and every method call on imported symbols
+- Cross-reference with `node_modules/<pkg>/package.json` + type definitions
+- For dynamic imports (`await import()`), inspect at runtime if possible
+**Signal**: import resolves but `<symbol>` not in the `.d.ts` or `__init__.py`.
+### 2. Hallucinated dependencies
+`npm package` or `pip package` that doesn't exist on the registry (or typo-squat).
+**Detection**:
+- For each new dep in PROPOSED_DEPS: `npm view <pkg> --json` or `pip index versions <pkg>`
+- Check publisher reputation (weekly downloads, last publish date, repo link present)
+- Typo-squat check: Levenshtein distance ≤ 2 from a popular package name is SUSPICIOUS
+**Signal**: registry returns 404, or package has < 100 downloads/week with no repo.
+### 3. Version drift
+Code uses an API that exists but at a different version than pinned.
+**Detection**:
+- For each external API call, check "Added in vX.Y" / "Deprecated in vX.Y" metadata
+- Compare against pinned version in lockfile
+**Signal**: API exists in v2, code pins v1 — silently broken.
+### 4. Async/sync mismatch
+Sync call in an async codebase or a Promise-returning function not awaited.
+**Detection** (TS):
+- `@typescript-eslint/no-floating-promises`
+- Grep for `fetch(`, `fs.readFileSync` (sync in async) or unawaited `async` functions
+- Any `Promise<T>` returned from a function whose callers don't `await`
+**Detection** (Python):
+- Sync `requests.get()` inside an `async def`
+- `asyncio.run()` called inside an event loop
+**Signal**: type checker emits "Promise returned but not awaited" OR sync call blocks in async context.
+### 5. Confident-wrong logic
+Code is syntactically and typing-wise valid, passes linting, but is semantically wrong:
+- Off-by-one on pagination
+- Wrong operator (`>=` where `>` needed)
+- Negated boolean
+- Swapped arguments of same type
+**Detection**:
+- Run existing tests (if present) — failing tests is the first signal
+- Invariant check: can you state in 1 sentence what the code guarantees? Does it actually guarantee it?
+- For any numerical boundary, ask: "off-by-one in either direction — which breaks?"
+**Signal**: behavior divergence between stated goal and actual execution.
+### 6. Extrinsic hallucination
+Output is plausible but references facts outside the code that cannot be verified:
+- Cites a spec section that doesn't exist
+- Comments claim "per RFC 7231 §5.3" when section 5.3 doesn't cover that
+- Error codes invented (`ERR_USER_QUOTA_EXCEEDED` — is that really thrown?)
+**Detection**:
+- Every code comment with a source claim → spot-check
+- Every user-facing string (error codes, log messages) → grep for prior use in the codebase
+**Signal**: claim cannot be corroborated.
+---
+## Report format
+```
+## AI-FAILURE-MODES VERDICT
+### Author
+LLM  (auto-detected via commit message pattern | user-declared)
+### Findings by mode
+1. Invented APIs:
+   [BLOCK] src/auth.ts:42 — `jwt.verifyStrict()` not in jsonwebtoken@9.0.2 (use `verify()` with `algorithms` option)
+2. Hallucinated deps:
+   (none — all 3 new deps exist on npm, >10k weekly downloads)
+3. Version drift:
+   [WARN] src/db.ts:18 — `drizzle.innerJoin()` added in v0.30, pinned 0.29 — upgrade drizzle-orm
+4. Async/sync mismatch:
+   [BLOCK] src/upload.ts:55 — `fs.writeFileSync()` inside async handler — blocks event loop
+5. Confident-wrong:
+   [WARN] src/pagination.ts:22 — `offset = page * pageSize` — off-by-one on page=0
+6. Extrinsic:
+   [INFO] src/rate-limit.ts:10 — comment cites "per RFC 6585 §4" — RFC 6585 does not have §4; 429 is §4 of RFC 6585 (comment is right, citation format wrong)
+### Summary
+BLOCK: 2
+WARN:  2
+INFO:  1
+```
+---
+## Guardrails
+- **BLOCK means don't merge** — invented APIs, hallucinated deps, and async/sync mismatches are production-breaking.
+- **WARN means discuss in review** — not auto-blocking but requires human acknowledgment.
+- **Run against diff, not whole repo** — old code isn't the subject; the new change is.
+- **When tests are absent**, confidence in "confident-wrong" findings drops — request tests be added before clearing the review.
+- **Don't false-positive on stubs** — intentional mocks in `__mocks__/` or `test-helpers/` may reference not-yet-implemented APIs; verify context.
+- **Typo-squat false positives**: popular packages sometimes have close cousins (`request` vs `request-promise`) — check download count AND repo history before flagging.
+---
+## How to verify
+- [ ] All 6 failure modes checked (invented APIs, hallucinated deps, version drift, async/sync, confident-wrong, extrinsic)?
+- [ ] Each finding has evidence (file:line or URL)?
+- [ ] VERDICT issued (CLEAN / FINDINGS)?
+- [ ] Author identified (LLM vs human)?
+- [ ] External API calls validated against official docs?
+## When triggered
+- Post-write hook when AUTHOR=LLM and task is Standard/Critical
+- Before any PR merge authored wholly or partially by an agent
+- After `@ciel-explorer` completes CODEBASE review
+- User command: "audit this code for AI mistakes"
+---
+## References
+- ISSTA 2025 — "LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation"
+- arxiv 2601.19106 — "Detecting and Correcting Hallucinations in LLM-Generated Code"
+- arxiv 2404.00971 — "Beyond Functional Correctness"
+- Anthropic 2604.08906 — agentic framework failure taxonomy

package/assets/skills/workflow/ask-window/SKILL.md ADDED Viewed

@@ -0,0 +1,119 @@
+---
+name: ask-window
+description: How to use the ASK window in Ciel v5 — before coding, clarify ambiguities using the question tool (OpenCode) or plan mode (Claude Code). Covers etapes 3 (ASK) and 10 (ASK2) of the pipeline. Prevents coding on assumptions.
+---
+# ASK Window — Clarify Before You Code (Ciel v5)
+## What this covers
+How to use the ASK window in the Ciel v5 pipeline. Before coding, the agent must ask clarifying questions rather than assuming. This skill covers etapes 3 (ASK after QUOI) and 10 (ASK2 after EVALUER).
+## Core principle
+**Do not code on assumptions.** When requirements are ambiguous, parameters undefined, or choices implicit -> ask. Use the question tool (OpenCode) or plan mode (Claude Code).
+## Two modes — when to use which
+```
+MODE ASK  (step 3)   → "What should I build?"     → after QUOI, before research
+MODE ASK2 (step 10)  → "Should I build this way?"  → after EVALUER, before coding
+```
+ASK is about **requirements** — clarify what to build. ASK2 is about **the plan** — validate how to build it.
+### ASK (step 3) — clarify requirements
+After QUOI, before any research or coding. Questions are about the **what**, not the **how**:
+- Requirements: "Is the email field required?"
+- Ambiguities: "Session cookie or JWT?"
+- Assumptions: "I assume the database is PostgreSQL, correct?"
+- Missing info: "What is the expected throughput?"
+- Scope boundaries: "Does this include the admin panel?"
+### ASK2 (step 10) — validate the plan
+After EVALUER, before FAIRE. Questions are about the **approach**, not the **requirements**:
+- Approach validation: "I'm going with approach A because X. OK?"
+- Trade-off validation: "Approach A is simpler but B is more flexible. OK with A?"
+- Risk confirmation: "The main risk is X. Acceptable?"
+- Effort check: "This will take ~2 hours. OK?"
+## How to ask
+### OpenCode: use the `question` tool
+The `question` tool is a built-in OpenCode tool. Each question includes:
+1. A header (category of question)
+2. The question text
+3. A list of options (at least 2)
+4. A custom answer option
+Example:
+```
+Tool: question
+Parameters:
+  header: "Database choice"
+  question: "Which database should we use for the new feature?"
+  options: ["PostgreSQL (existing)", "SQLite (simpler)", "MySQL (new)"]
+```
+### Claude Code: use plan mode
+In Claude Code, switch to plan mode (Tab key), then:
+1. List your assumptions explicitly
+2. Ask questions one at a time
+3. Wait for answers before proceeding
+4. Update the plan based on responses
+## Structure of a good question
+```
+QUESTION CATEGORY: <requirement | assumption | tradeoff | risk | scope>
+Context: <1-2 sentences explaining why you're asking>
+Question: <clear, specific question>
+Options:
+  A: <option>
+  B: <option>
+  C: <other> (if relevant)
+```
+## What NOT to ask
+- Things you can discover yourself (read the code, check package.json)
+- Trivial preferences that don't affect the design (naming, formatting)
+- The same question twice (check your previous answers)
+- Questions you already have the answer to (check .ciel/memory.json, overlay)
+## Output format
+After the ASK phase, include in the plan:
+```
+## Questions asked (ASK)
+1. <question> -> <answer selected>
+2. <question> -> <custom answer>
+```
+## Common rationalizations
+| Rationalization | Reality |
+|---|---|
+| "I'll just assume and fix it later" | Later is when it's in production and costs 10x to fix. Asking now costs 30 seconds. |
+| "The user would have told me if it mattered" | Users don't know what they don't specify. Assumptions are silent bugs waiting to surface. |
+| "I can figure it out from the code" | Code tells you WHAT, not WHY. If there are two valid approaches, code can't tell you which one the project prefers. |
+| "Asking makes me look uncertain" | Coding on wrong assumptions makes you look incompetent. Asking is what senior engineers do. |
+## How to verify
+- [ ] All unclear requirements have been asked about?
+- [ ] Assumptions have been validated (not silently filled)?
+- [ ] Trade-offs have been offered to the user?
+- [ ] Questions are specific, not vague ("what do you think?")
+- [ ] Answers are captured in the plan?

package/assets/skills/workflow/avec-quoi-versioner/SKILL.md ADDED Viewed

@@ -0,0 +1,111 @@
+---
+name: avec-quoi-versioner
+description: Reads actual installed library versions from package.json, build.gradle, go.mod, Cargo.toml, pyproject.toml, Gemfile.lock — never trusts memory or assumptions. Loads ciel-overlay.md if present for project-specific stack context. Invoked before research to ensure all subsequent docs lookups target the correct versions.
+allowed-tools: Read, Grep, Glob, Bash
+---
+# avec-quoi-versioner — Read real installed versions
+Step 2 of CRÉER. The research quality is bounded by version accuracy. A skill that looks up "Ktor 2.x docs" when the project runs Ktor 3.x produces anti-patterns.
+---
+## Process
+### 1. Detect package manager(s)
+Scan project root for the following files (in order):
+| File | Stack |
+|------|-------|
+| `package.json` + `package-lock.json` | npm / Node.js |
+| `package.json` + `yarn.lock` | yarn |
+| `package.json` + `pnpm-lock.yaml` | pnpm |
+| `package.json` + `bun.lockb` | bun |
+| `build.gradle.kts` / `build.gradle` | JVM / Gradle |
+| `pom.xml` | Maven |
+| `go.mod` + `go.sum` | Go |
+| `Cargo.toml` + `Cargo.lock` | Rust |
+| `pyproject.toml` + `poetry.lock` / `uv.lock` | Python |
+| `requirements.txt` | Python (pip) |
+| `Gemfile` + `Gemfile.lock` | Ruby |
+| `composer.json` | PHP |
+| `Package.swift` / `Package.resolved` | Swift |
+Multiple lockfiles may exist (monorepo). Read them all.
+### 2. Extract exact versions (not semver ranges)
+For each relevant dependency in the task scope:
+- Read the **lockfile** for the pinned version (not `package.json`'s range)
+- For Gradle, run `./gradlew dependencies` if needed, or read `gradle.properties`
+- For Go, `go.mod` already pins; verify with `go list -m all`
+- For Maven, effective POM: `mvn help:effective-pom`
+### 3. Load ciel-overlay.md
+If present at project root, extract:
+- `## Stack` section — project's declared stack
+- `## Versions` section — URLs to docs
+- Any project-specific rules in `## Règles projet-spécifiques`
+### 4. State assumptions explicitly
+For anything NOT verified from lockfile:
+- "Assuming build tool X because [reason]."
+- "Assuming PostgreSQL is running on default port because [reason]."
+These assumptions must be flagged for `researcher` to verify.
+---
+## Output format
+```
+## AVEC QUOI
+Stack detected:
+- Frontend: <framework> <version> (from <file>)
+- Backend: <framework> <version> (from <file>)
+- Database: <type> <version> (from <file or overlay>)
+- Test: <framework> <version> (from <file>)
+- Build: <tool> <version>
+Overlay:
+- [Loaded: yes/no]
+- [Relevant sections: Stack, Versions, Règles, Leçons]
+Assumptions (NOT from lockfile):
+- <assumption> — <reason>
+Docs URLs (from overlay):
+- <lib>: <url>
+```
+---
+## Guardrails
+- **Never assume a version** — if lockfile is absent, state "version unknown" and flag it
+- **Range vs pinned**: always report the pinned version from the lockfile, not the `^1.2.3` range from the manifest
+- **Monorepo caution**: multiple lockfiles may diverge across packages. Specify which package the version applies to.
+- **Don't guess URLs**: only report doc URLs from the overlay. Let `researcher` agent WebSearch for the rest.
+---
+## How to verify
+- [ ] Versions read from lock files (not package.json ranges)?
+- [ ] ciel-overlay.md consulted for project-specific versions?
+- [ ] Framework detected (React/Vue/Svelte/Ktor/Express/etc)?
+- [ ] Version gaps flagged (installed vs latest)?
+- [ ] Overlay updated if new versions discovered?
+## When triggered
+- Standard/Critical tasks, immediately after `quoi-framer`
+- Before dispatching `researcher` agent (research quality depends on version accuracy)
+- When user asks "what versions are we on?" or the task mentions a specific library

package/assets/skills/workflow/ci-watcher/SKILL.md ADDED Viewed

@@ -0,0 +1,194 @@
+---
+name: ci-watcher
+description: Streams GitHub Actions via `gh run watch`, classifies failures flaky (≥15% fail rate on main → auto-`gh run rerun --failed`) vs real (hand off to debug-reasoning-rca). Invoke after pr-opener, before pr-merger, or on "CI stuck" / "why is CI red" / "flaky test". Inline.
+allowed-tools: Bash, Read
+context: inline
+---
+# ci-watcher — Watch CI, distinguish flaky from broken, retry smart
+`prouver-verifier` takes a single-point snapshot of CI state. `ci-watcher` watches over time: streams the run, waits for completion, classifies failures as flaky vs real, retries only what's safe.
+---
+## Inputs
+```
+BRANCH: [current branch — from git rev-parse]
+PR_NUMBER: [optional — derived if branch has an open PR]
+WORKFLOW: [optional — filter to specific workflow name; default all]
+MODE: [watch | snapshot — default watch; snapshot = single poll + return]
+MAX_RETRIES: [default 1 — for flaky-detected failures only]
+FLAKY_THRESHOLD: [default 15 — % fail rate on main that classifies as flaky]
+```
+### Auto-inference sources
+- **BRANCH** → `git rev-parse --abbrev-ref HEAD`
+- **PR_NUMBER** → `gh pr view --json number --jq .number 2>/dev/null`
+- **WORKFLOW** → all workflows that ran on the branch
+---
+## Preflight
+```bash
+gh auth status 2>&1 | grep -q "Logged in" || exit 1
+BRANCH=${BRANCH:-$(git rev-parse --abbrev-ref HEAD)}
+# Confirm branch has at least one run
+LATEST=$(gh run list --branch="$BRANCH" --limit=1 --json databaseId,status,conclusion --jq '.[0] // empty')
+[ -z "$LATEST" ] && { echo "No runs found for branch $BRANCH — push first"; exit 1; }
+```
+---
+## Process
+### 1. Stream or snapshot
+**Watch mode (default)** — stream until completion:
+```bash
+RUN_ID=$(gh run list --branch="$BRANCH" --limit=1 --json databaseId --jq '.[0].databaseId')
+gh run watch "$RUN_ID" --exit-status
+RESULT=$?
+```
+`--exit-status` returns non-zero on run failure. `gh run watch` streams logs as they appear.
+**Snapshot mode** — single poll:
+```bash
+gh run list --branch="$BRANCH" --limit=5 --json databaseId,name,status,conclusion,url
+```
+### 2. On failure — classify flaky vs real
+```bash
+# Get failing jobs for this run
+FAILED_JOBS=$(gh run view "$RUN_ID" --json jobs --jq '.jobs[] | select(.conclusion == "failure") | .name')
+# For each failed job, check history on the base branch
+BASE=$(gh pr view "$PR_NUMBER" --json baseRefName --jq .baseRefName 2>/dev/null || echo "main")
+for JOB in $FAILED_JOBS; do
+  # Last 50 runs on base branch for same workflow
+  WORKFLOW=$(gh run view "$RUN_ID" --json workflowName --jq .workflowName)
+  FAIL_RATE=$(gh run list \
+    --branch="$BASE" \
+    --workflow="$WORKFLOW" \
+    --limit=50 \
+    --json conclusion \
+    --jq '[.[] | select(.conclusion == "failure")] | length')
+  FAIL_PCT=$((FAIL_RATE * 100 / 50))
+  if [ "$FAIL_PCT" -ge "$FLAKY_THRESHOLD" ]; then
+    echo "Job '$JOB' fails ${FAIL_PCT}% of the time on $BASE — CLASSIFIED FLAKY"
+    FLAKY_JOBS+=("$JOB")
+  else
+    echo "Job '$JOB' fails ${FAIL_PCT}% of the time on $BASE — CLASSIFIED REAL FAILURE"
+    REAL_FAILURES+=("$JOB")
+  fi
+done
+```
+**Flaky threshold rationale**: 15% = 7-8 failures in 50 runs. Below that, a single failure is likely the PR's fault. Above, it's environmental/test-harness instability.
+### 3. Retry flaky jobs (up to MAX_RETRIES)
+```bash
+if [ ${#FLAKY_JOBS[@]} -gt 0 ] && [ "$RETRY_COUNT" -lt "$MAX_RETRIES" ]; then
+  echo "Retrying flaky jobs (attempt $((RETRY_COUNT+1))/$MAX_RETRIES)"
+  gh run rerun "$RUN_ID" --failed
+  RETRY_COUNT=$((RETRY_COUNT+1))
+  # Re-enter step 1 (watch the rerun)
+fi
+```
+`--failed` only reruns failed jobs (saves CI minutes).
+### 4. Extract log excerpt for real failures
+For handoff to `debug-reasoning-rca`:
+```bash
+for JOB in $REAL_FAILURES; do
+  JOB_ID=$(gh run view "$RUN_ID" --json jobs --jq ".jobs[] | select(.name == \"$JOB\") | .databaseId")
+  # Last 50 lines of the failing step
+  gh run view --job="$JOB_ID" --log-failed | tail -50
+done
+```
+### 5. Emit output
+```
+[CI WATCHER]
+Run: <URL>
+Status: <success | failure | in_progress>
+Duration: <Xm Ys>
+Jobs:
+  [OK]   build
+  [OK]   lint
+  [WARN] integration-tests — FLAKY (fails 18% on main, retried — now green)
+  [FAIL] unit-tests — REAL FAILURE (fails 2% on main — investigate)
+Handoff (if real failures):
+  - debug-reasoning-rca with SYMPTOM=<failing test name> + LOG excerpt
+```
+---
+## Guardrails
+- **MAX_RETRIES=1 by default** — a flaky test that fails twice in a row is likely not flaky. Don't spam retries.
+- **Never retry real failures** — the retry mechanism is ONLY for jobs classified flaky. Real failures need a code fix.
+- **Never retry pre-merge checks on main** — only PR branches. Retrying on main risks hiding real regressions.
+- **Budget-aware**: large rerun loops burn CI minutes. Log estimated minutes cost before retry on repos with tight budgets.
+- **Respect timeouts**: `gh run watch` can hang if a job hangs. Wrap in `timeout 1800 gh run watch` for 30-min ceiling.
+- **Flaky classification is per-job, not per-run**: if 3 of 5 jobs are flaky but 1 is real, DO NOT retry — fix the real one first.
+- **Store flaky detections** — append to `.claude/flaky-tests.log` (optional, per-project) so patterns surface across sessions.
+---
+## When triggered
+- After `pr-opener` in Standard pipeline (step 11 post-insertion)
+- Before `pr-merger` as CI-green verification (replaces inline `gh run list` check)
+- User says: "watch CI", "is CI green?", "CI is flaky", "rerun failed jobs"
+- `prouver-verifier` detects a red CI and needs disambiguation
+---
+## Anti-pattern
+```
+❌ Failed → rerun blindly → rerun → rerun → real bug hidden, minutes wasted
+✅ Failed → classify (fail % on main) → retry only flaky → real fail → debug-reasoning-rca
+```
+```
+❌ sleep 300 && gh run list       # blocked by harness; also cache-cold
+✅ gh run watch --exit-status     # streams, no sleep
+```
+---
+## Handoff
+- **If all green** → `pr-merger` can proceed
+- **If real failure** → `debug-reasoning-rca` via `@ciel-critic` with log excerpt as SYMPTOM + failing job as SCOPE
+- **If flaky detected + retry succeeded** → proceed to `pr-merger`, log flaky for future `/ciel-improve` signal
+- **If flaky + retry failed** → escalate to user (flaky-turned-real or real-misclassified)
+---
+## References
+- `gh run watch` — cli.github.com/manual/gh_run_watch
+- `gh run rerun --failed` — cli.github.com/manual/gh_run_rerun
+- Flaky test classification — Google's 2020 paper "Taming Google-scale continuous testing" (15% threshold baseline)
+- Ciel pipeline: pr-opener → ci-watcher → (flaky? retry : debug-reasoning-rca) → pr-merger