npm - @neikyun/ciel - Versions diffs - 6.11.0 → 6.11.2 - Mend

@neikyun/ciel 6.11.0 → 6.11.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (39) hide show

package/assets/skills/workflow/relire-critic/SKILL.md ADDED Viewed

@@ -0,0 +1,99 @@
+---
+name: relire-critic
+description: How to self-review code effectively — hostile critique methodology, risk taxonomy, and quality checklist. Generates exactly 3 targeted critiques (functional, import/API, data assumption) then resolves each. Applicable after any code change.
+allowed-tools: Read, Grep, Glob, Bash
+---
+# Code Self-Review — Hostile Critique Methodology
+## What this covers
+How to review your own code as if someone else wrote it. Self-review fails because the author reinforces their own blind spots (degeneration of thought, CriticBench 2024). This methodology forces adversarial thinking.
+## Core principle
+Read changed files **as if someone else wrote them**. Your job is to find what could fail, not to confirm what works.
+## Methodology: 3 RISQUES
+Generate EXACTLY 3 specific critiques of the changed code. Not 2, not 5 — 3 forces focus.
+### Mandatory distribution
+Each set of 3 RISQUES must include:
+1. **Functional risk** — what breaks for users? "This fails when..."
+2. **Import/API surface check** — does this import path actually exist? Is the API contract correct?
+3. **Data assumption check** — does this DB column / response shape / format actually match reality?
+### Specificity rules
+- Concrete, not abstract: "might have bugs" is invalid
+- Reference specific `file:line` where the risk lives
+- Can't generate 3 specific critiques → you don't understand the code → read more
+### Format
+```
+RISQUE: [what could fail] parce que [root cause] — IMPACT: [consequence]
+```
+## Resolution
+For each RISQUE, choose ONE:
+- **FIX**: exact correction needed — name the code change
+- **ACCEPT**: why the risk is acceptable (TTL? cosmetic? window < 1s?)
+- **DEFER**: issue reference + why out of scope
+If 0 fixes needed → suspicious. Re-examine for specificity.
+## Quality checklist (8 items)
+Apply after resolving RISQUES:
+1. Quality gates respected? (complexity < 15, nesting < 4, functions < 50 lines)
+2. All new imports exist in actual files at stated paths?
+3. All DB columns referenced exist in real schema?
+4. Test mocks on same host:port as actual requests?
+5. Tests could fail independently of implementation?
+6. Duplicated logic with existing code?
+7. Linter clean? (0 new violations vs base branch)
+8. Would a staff engineer approve this without changes?
+Each item: evidence (`file:line` or command output) or explicit "N/A because X".
+## Output format
+```
+## RISQUES
+1. RISQUE: <X> parce que <Y> — IMPACT: <Z>
+   → FIX/ACCEPT/DEFER: <resolution>
+2. ...
+3. ...
+## CHECKLIST
+- [✓/✗/N/A] <item> — <evidence>
+...
+## VERDICT
+BLOCKING: <list or "none">
+IMPORTANT: <list or "none">
+MINOR: <list or "none">
+```
+## How to verify
+- [ ] Exactly 3 RISQUES (no more, no less)?
+- [ ] Distribution: 1 functional + 1 import + 1 data-assumption?
+- [ ] Each RISQUE has file:line evidence?
+- [ ] Each RISQUE has resolution (FIX/ACCEPT/DEFER)?
+- [ ] Quality checklist (8 items) completed?
+- [ ] VERDICT issued (BLOCKING/IMPORTANT/MINOR)?
+## Common mistakes
+- **Generic critiques**: "might not scale" → too vague. "Loads all users into memory at line 47, O(n)" → specific.
+- **Skipping distribution**: all 3 are functional risks, no import or data check → incomplete.
+- **Too many RISQUES**: 5 critiques dilute focus. Pick top 3 by severity.
+- **Not reading code**: reviewing the description instead of the actual file → always read code first.

package/assets/skills/workflow/security-regression-check/SKILL.md ADDED Viewed

@@ -0,0 +1,86 @@
+---
+name: security-regression-check
+description: How to check for security regressions in a diff — greps for new inputs, removed auth blocks, new external calls, new file access, new SQL/eval, and new trust boundaries. Attacker-eye review of what changed, not what was intended.
+allowed-tools: Read, Grep, Bash
+---
+# Security Regression Check — Attacker Eyes on the Diff
+## What this covers
+How to check if a code change introduced security regressions. The hypothesis: "I fixed A without touching B" is NOT a check. Read the diff with attacker eyes — what did my fix add that wasn't there before?
+## Core principle
+**Read `+` lines with attacker eyes, not author eyes.** The author's intent is irrelevant. What can an external actor do with this code path?
+## Process
+### 1. Capture the diff
+```bash
+git diff --unified=3 HEAD
+```
+### 2. Grep for risk signals
+| Signal | What to search | Why it matters |
+|--------|---------------|----------------|
+| New request param reads | `call.parameters[`, `request.body.`, `req.query.`, `req.params.` | New inputs = new validation surface |
+| Removed auth blocks | `-` lines with `authenticate`, `requireAuth`, `verifyToken`, `checkPermission` | Removed auth = privilege escalation |
+| New external calls | `+` lines with `fetch(`, `axios(`, `httpClient.` | New outbound = SSRF / data exfil risk |
+| New file reads/writes | `+` lines with `File(`, `fs.readFile`, `fs.writeFile`, `Path(` | New FS access = path traversal risk |
+| New SQL | `+` lines with SELECT, INSERT, UPDATE, DELETE | New queries = injection risk if concat |
+| New eval/exec | `+` lines with `eval(`, `Function(`, `exec(` | Code injection risk |
+| New trust boundaries | `+` lines with cookies, tokens, sessions | New trust = new spoofing surface |
+### 3. Classify each finding
+- **Critical** — must address before merge
+- **Important** — document + address OR accept with rationale
+- **Informational** — note for reflection
+## Output format
+```
+## SECURITY REGRESSION CHECK
+Diff scope: <N files, +X -Y lines>
+### New inputs (from request)
+- <file:line> — <new param> — <has validation?>
+### Removed/modified auth
+- <file:line> — <what changed>
+### New external calls
+- <file:line> — <target | dynamic URL risk>
+### New file/FS access
+- <file:line> — <path controlled by user?>
+### New SQL / eval
+- <file:line> — <parameterized? safe?>
+### New trust boundaries
+- <file:line> — <cookie/token/session change>
+### VERDICT
+- Critical: <list or none>
+- Important: <list or none>
+- Informational: <list or none>
+```
+## How to verify
+- [ ] Diff captured and reviewed?
+- [ ] Risk signals grepped (new inputs, removed auth, external calls, file access, SQL/eval, trust boundaries)?
+- [ ] Each finding classified (SAFE / RISK / BLOCK)?
+- [ ] VERDICT issued (CLEAN / FINDINGS)?
+- [ ] Attacker perspective applied?
+## Key rules
+- **Diff scope matters**: 500-line diff → process in chunks. Fatigue causes misses.
+- **Don't trust commit messages**: "just a refactor" still needs the check. Refactors routinely remove validation.
+- **"No error" ≠ safe**: absence of error messages doesn't mean the change is secure.

package/assets/skills/workflow/self-consistency-verifier/SKILL.md ADDED Viewed

@@ -0,0 +1,85 @@
+---
+name: self-consistency-verifier
+description: How to verify AI-generated code by generating 3 independent solutions, comparing them at syntactic/AST/behavioral levels, and scoring consistency. Divergent solutions indicate model uncertainty — re-prompt with constraints or escalate. Based on IdentityChain (2024) and Consistency-Aided Tested Code Generation (ACM 2025).
+allowed-tools: Read, Grep, Glob, Bash, Write
+---
+# Self-Consistency Verifier — If Three of You Disagree, One of You Is Wrong
+## What this covers
+How to verify AI-generated code by generating 3 diverse solutions and comparing them. A confident LLM that generates 3 semantically identical solutions is probably right. A confident LLM that generates 3 divergent solutions is the dangerous case — it'll ship whichever came out first. Self-consistency is the cheapest high-signal uncertainty estimator available.
+## Core principle
+**Divergence is diagnostic.** When solutions disagree, the disagreement itself tells you what constraint is missing. Don't just pick one — understand WHY they differ.
+## Methodology
+### Generate 3 diverse solutions
+Re-prompt the LLM 3 times with diversifying seeds. The goal is divergent initial approaches, not different variable names.
+**Diversification strategies** (pick 3 out of 5):
+1. **Constraint-reorder** — restate the problem with constraints in a different order
+2. **Language-shift** — ask for pseudocode first, THEN translate to target language
+3. **Test-first** — ask for test cases first, THEN the implementation
+4. **Adversarial framing** — "what would break this naïve solution?" then write the robust version
+5. **Reference implementation** — "find the canonical pattern" then adapt
+### Compare at 3 levels
+**Level A — Syntactic (cheap)**
+- Run formatter, normalize whitespace, compute textual diff
+- Identical after format → consistency HIGH, skip to verdict
+- Differ only in variable names → consistency HIGH
+- Structural diff → proceed to Level B
+**Level B — AST-level (medium)**
+- Parse each solution to AST
+- Compare: function signatures, control flow shape, side-effect surface, data shape flow
+- Score: `consistency = matched_nodes / total_nodes`. ≥0.85 = HIGH, 0.60-0.85 = MEDIUM, <0.60 = LOW
+**Level C — Behavioral (expensive, Critical only)**
+- Generate 10-20 property-based test cases (`fast-check` / `hypothesis`)
+- Run each solution against the same test cases
+- All 3 pass all cases → consistency HIGH
+- Divergent pass/fail patterns → at least one is wrong; use majority vote + investigate outlier
+### Interpret divergence
+| Divergence type | Interpretation | Action |
+|---|---|---|
+| One solution handles edge case X, others don't | Missing explicit constraint | Add constraint, re-generate |
+| Solutions use different libraries | Library choice under-specified | Pin the lib, pick one |
+| Solutions use different algorithms with different complexity | Performance under-specified | Add perf constraint |
+| Solutions have different error-handling | Error model under-specified | Specify what errors to surface |
+| Two agree, one is outlier | Majority-vote the two, investigate outlier for missed insight | Use the majority |
+| All three disagree | Problem under-specified or too hard | Escalate to human |
+## Key points
+- **Cost budget**: Critical = full 3-level compare, ≤15 min. Standard = syntactic + AST only, ≤5 min. Trivial = skip entirely
+- **Don't re-generate with the same prompt** — identical prompts produce highly similar outputs; the check becomes trivial. Always diversify
+- **Don't majority-vote blindly** — an outlier that catches an edge case the other two missed is the RIGHT answer. Investigate before voting
+- **AST compare requires a parser** — if the target language lacks easy AST access, fall back to behavioral compare or skip Level B
+- **Three is the magic number** — two is a tie, four is diminishing returns
+## Common anti-patterns
+1. **Same-prompt re-generation**: identical prompts produce near-identical outputs, making the check trivial and useless
+2. **Blind majority voting**: an outlier may be the only one that caught a real edge case — investigate before discarding
+3. **Skipping divergence analysis**: the WHY of divergence is more valuable than the score itself
+4. **Running behavioral tests on every task**: reserve for Critical code only; syntactic + AST is enough for Standard
+## How to verify
+- **Score threshold**: ≥0.85 = HIGH confidence, proceed. 0.60-0.85 = MEDIUM, adopt majority + add tests. <0.60 = LOW, re-prompt or escalate
+- **Edge case surfacing**: divergence analysis should produce at least 1 concrete edge case to test
+- **Constraint improvement**: after divergence, the problem statement should have more constraints than before
+## References
+- IdentityChain — openreview.net/forum?id=caW7LdAALh — self-consistency for code LLMs
+- ACM 2025 — "Consistency-Aided Tested Code Generation with LLM" (dl.acm.org/doi/pdf/10.1145/3728902)
+- arxiv 2507.06920 — "Rethinking Verification for LLM Code Generation: From Generation to Testing"

package/assets/skills/workflow/spike-mode/SKILL.md ADDED Viewed

@@ -0,0 +1,101 @@
+---
+name: spike-mode
+description: How to use SPIKE mode in Ciel v5 — prototype/exploration mode with assoupli gates. Create .ciel/exploration.active to enter spike mode. Quality gates relaxed, code marked FIXME/TODO. Used for POC, draft, experimental, throwaway code. Must be refactored properly after.
+---
+# SPIKE Mode — Explore Without Commitment (Ciel v5)
+## What this covers
+How to use SPIKE mode in Ciel v5 for prototyping and exploration. When you need to test an idea quickly without going through the full quality pipeline. The mode is triggered by creating a `.ciel/exploration.active` file in the project root.
+## Core principle
+**Speed over quality during exploration. Quality over speed for production.** SPIKE mode exists because sometimes you need to write throwaway code to validate an approach. But throwaway code that stays is technical debt.
+## When to use SPIKE mode
+- Prototyping a new feature
+- Testing a library integration
+- Exploring a complex refactoring
+- Validating an architecture approach
+- POC / proof of concept
+- "I don't know if this will work, let me try"
+Do NOT use SPIKE mode for:
+- Production code
+- Code you plan to keep
+- Critical/security code
+- Code you already know how to implement
+## How to enter SPIKE mode
+```bash
+touch .ciel/exploration.active
+```
+The plugin detects this file and:
+- Assouplit gates 1 (test-first) and 4 (quality)
+- Injects SPIKE mode indicator in system prompt
+- Marks all code as experimental
+## How to exit SPIKE mode
+```bash
+rm .ciel/exploration.active
+```
+Or when the exploration is done,
+- Refactor the experimental code properly
+- Add tests
+- Follow the full pipeline
+## What changes in SPIKE mode
+| Gate | Standard mode | SPIKE mode |
+|------|---------------|------------|
+| Test-first (RED) | Bloquant | Assoupli |
+| Alternatives | Requis | Recommande |
+| Idiomatic | Requis | Recommande |
+| Quality (complexity, nesting) | Enforce | Assoupli |
+| Removal safety | Requis | Requis |
+| Boy-scout | Recommande | Recommande |
+| FIXME/TODO markers | Optionnel | OBLIGATOIRE |
+## Output format
+When in SPIKE mode, add this to the plan:
+```
+## SPIKE MODE
+Goal: <what are we trying to learn/prove?>
+Exit criteria: <when is this exploration done?>
+Markers: <files marked FIXME/TODO>
+Follow-up task: <describe the proper implementation>
+```
+## Common rationalizations
+| Rationalization | Reality |
+|---|---|
+| "I'll clean up the spike code later" | You won't. If you don't schedule the cleanup immediately, spike code becomes permanent debt. |
+| "The gates are annoying, I'll use spike mode" | SPIKE mode is for when you DON'T KNOW the solution, not for when you don't WANT to write tests. |
+| "This is just a quick prototype, no need for FIXME" | Unmarked prototype code looks like production code. Without FIXME, nobody knows it needs refactoring. Future you included. |
+| "SPIKE mode means no rules" | SPIKE assouplit les gates mais ne les supprime pas. Security et removal restent actifs. |
+## Rules
+- **Code written in SPIKE mode MUST be marked FIXME or TODO**
+- **SPIKE code MUST be refactored or removed after exploration**
+- **SPIKE mode does not bypass security gates** (removal safety still applies)
+- **Do not commit SPIKE code without refactoring**
+- **SPIKE mode is for individual exploration sessions, not for PRs**
+## How to verify
+- [ ] .ciel/exploration.active exists?
+- [ ] All exploratory code has FIXME/TODO markers?
+- [ ] Exit criteria defined?
+- [ ] Follow-up task created for proper implementation?
+- [ ] No SPIKE code committed without refactoring?

package/assets/skills/workflow/stride-analyzer/SKILL.md ADDED Viewed

@@ -0,0 +1,96 @@
+---
+name: stride-analyzer
+description: How to threat model with STRIDE — 3-pass methodology: risk-rank by mechanical signals, STRIDE 6 categories (Spoofing/Tampering/Repudiation/Info Disclosure/DoS/Elevation) with grep evidence, and killer checklist. For auth, DB schema, payment, security changes.
+allowed-tools: Read, Grep, Glob, Bash
+---
+# STRIDE Threat Modeling — Security Analysis Methodology
+## What this covers
+How to do a security threat model using STRIDE. STRIDE is the framework; grep is the evidence. No theater — every finding needs `file:line` proof.
+## Core principle
+**Anti-theater rule**: every checklist item needs evidence (file:line or grep output). "Checked ✓" with no evidence = not checked.
+## Pass 1: Risk rank (mechanical signals)
+Classify the change:
+- **Critical** if ANY: `auth/`, `security/`, DB tables (users, sessions, tokens), `.executeQuery`, `.executeUpdate`, `userId`, `password`, `token`, `secret`
+- **Important** if ANY: diff > 5 files, `validate`, `sanitize`, `rateLimit`, route handlers
+- **Routine** otherwise
+→ Critical = all 3 passes. Important = passes 2+3. Routine = pass 3 only.
+## Pass 2: STRIDE 6 categories (Critical/Important)
+For each category, answer with grep-backed evidence:
+| Category | Question | Evidence type |
+|----------|----------|--------------|
+| **S**poofing | Can I impersonate someone? | Auth checks, token validation |
+| **T**ampering | Can input be modified in transit? | Input validation, integrity checks |
+| **R**epudiation | Can a user deny this action? | Audit logging, timestamps |
+| **I**nfo Disclosure | What leaks? | Error messages, logs, responses |
+| **D**oS | Can this be flooded/exhausted? | Rate limits, resource bounds |
+| **E**levation | Can I access what I shouldn't? | Authorization checks, role validation |
+Each answer: grep-backed or "N/A because X". **Mark N/A explicitly, never skip silently.**
+**OPS lens** (overlayed on STRIDE): unclosed connections, memory leaks, locks, behavior at 100x volume.
+## Pass 3: Killer checklist (all levels)
+- Same field = same validation everywhere? (grep to verify)
+- Same domain = same auth on ALL transports (REST + WS + SSE)?
+- Identity fields resolved server-side, never client-supplied?
+- SQL parameterized, never interpolated?
+- PII touched = anonymization covered?
+Each item: evidence (`file:line` or grep output) or N/A.
+## Output format
+```
+## STRIDE ANALYSIS
+### Risk rank: <Critical | Important | Routine>
+Signals: <list>
+### STRIDE (if Critical/Important)
+- S (Spoofing): <N/A because X | RISQUE: ... — evidence: file:line>
+- T (Tampering): <...>
+- R (Repudiation): <...>
+- I (Info Disclosure): <...>
+- D (DoS): <...>
+- E (Elevation): <...>
+OPS: <connections | memory | locks | 100x volume>
+### Killer checklist
+- [✓/✗] Same validation everywhere — evidence: <grep output>
+- [✓/✗] Auth parity across transports — evidence: <...>
+- [✓/✗] Identity server-side — evidence: <...>
+- [✓/✗] SQL parameterized — evidence: <...>
+- [✓/✗] PII anonymization — evidence: <...>
+### VERDICT
+BLOCKING: <list or none>
+IMPORTANT: <list or none>
+```
+## How to verify
+- [ ] Pass 1 (Risk rank) completed with mechanical signals?
+- [ ] Pass 2 (STRIDE 6 categories) — all categories have findings or explicit "N/A because X"?
+- [ ] Pass 3 (Killer checklist) completed?
+- [ ] VERDICT issued (PROCEED / BLOCK / INVESTIGATE)?
+- [ ] Evidence format: `file:line` or grep output?
+## Key rules
+- **Don't skip categories silently**: every STRIDE category gets a finding or explicit "N/A because X"
+- **Evidence format**: `path/to/file.ext:123` or `grep -n "pattern" src/` output
+- **Rotate stale items**: if a checklist item catches nothing in 10+ audits, consider replacing it

package/assets/skills/workflow/stride-analyzer/reference.md ADDED Viewed

@@ -0,0 +1,144 @@
+# stride-analyzer — Reference
+## STRIDE — detailed category probes
+### S — Spoofing (identity)
+Can I impersonate another user/service/system?
+Probes:
+- Grep for `userId` / `user_id` coming from request params vs resolved server-side (JWT, session)
+- Grep for identity claims trusted without verification (e.g. `X-User-Id` header accepted as-is)
+- Check auth middleware ordering: is authentication before authorization?
+- WebSocket/SSE: is the same auth applied? (common gap: REST auth is bulletproof, WS accepts any token)
+Evidence format:
+```
+- Spoofing: userId extracted from JWT claim at JwtMiddleware.kt:45 — not client-supplied ✓
+```
+### T — Tampering (data integrity)
+Can input be modified in transit or at rest without detection?
+Probes:
+- HTTPS everywhere? Grep for `http://` (non-localhost)
+- CSRF tokens on state-changing endpoints?
+- Signed cookies / signed JWTs? What algorithm? (HS256 vs RS256 considerations)
+- Database writes: is the audit trail immutable? (INSERT-only tables for events)
+### R — Repudiation (non-denial)
+Can a user deny having performed an action?
+Probes:
+- Audit log coverage: what events are logged? With what identity?
+- Log tampering resistance: append-only? Logged externally?
+- Timestamp source: server-controlled? Synced?
+### I — Information Disclosure
+What information leaks to unauthorized parties?
+Probes:
+- Error messages: do they include stack traces / SQL / paths / credentials?
+- Logs: do they contain PII, secrets, tokens?
+- API responses: over-fetching? `SELECT *` instead of projected columns?
+- 404 vs 403 distinction: timing attack on existence probe?
+- Autocomplete endpoints: leak usernames / emails?
+### D — Denial of Service
+Can this be flooded or exhausted?
+Probes:
+- Rate limiting: per-IP? per-user? per-endpoint?
+- Resource bounds: max payload size? max query depth (GraphQL)? max file upload?
+- Algorithmic complexity: O(n²) loops on user-controlled n?
+- Connection pooling: max connections? timeout?
+- Regex catastrophic backtracking on user input?
+### E — Elevation of Privilege
+Can I access what I shouldn't?
+Probes:
+- RBAC/ABAC correctness: does the permission check run before the action?
+- Horizontal privilege escalation: can user A read user B's data with API manipulation?
+- Vertical privilege escalation: can user become admin via some path?
+- Mass assignment: can user set `isAdmin` via PATCH body?
+## OPS lens (overlayed on STRIDE)
+- **Unclosed connections**: grep for `conn.close()` / `client.close()` / `try-with-resources` / `use {}` — every open should have a close
+- **Memory leaks**: long-lived caches without eviction? Unbounded collections? Listeners not removed?
+- **Locks**: deadlock-prone order? Held across I/O?
+- **100x volume**: if traffic grew 100x tomorrow, what breaks first?
+## Killer checklist — detail
+### Same field = same validation everywhere
+If `email` is validated one way in `RegisterRoute.kt` and another way in `ProfileUpdateRoute.kt`, an attacker uses the weaker one. Validation must be centralized.
+```bash
+# Find all places email is validated
+grep -rn "email" --include='*.kt' src/ | grep -iE 'valid|sanitize|check'
+```
+Evidence: all call sites converge on a single validator.
+### Same domain = same auth on ALL transports
+REST endpoint has auth; WebSocket channel for the same resource doesn't (or uses different auth). Attacker bypasses via WebSocket.
+```bash
+grep -rn "authenticate" src/ --include='*.kt'
+grep -rn "socket\|websocket\|sse\|webFluxClient" src/
+```
+### Identity resolved server-side
+```bash
+# Any userId coming from request body/path?
+grep -rn 'call.parameters\["userId"\]' src/
+grep -rn 'request.body.userId' src/
+# Should all be via JWT/session claim
+```
+### SQL parameterized
+```bash
+# Find string interpolation in SQL
+grep -rn "\\\$" src/ --include='*.kt' | grep -iE 'sql|query'
+grep -rn "\"SELECT.*\"\ +\ " src/
+```
+### PII anonymization
+```bash
+# Find logging of user fields
+grep -rn "logger.info.*user" src/
+grep -rn "println.*email\|println.*phone" src/
+```
+## Multi-PR delegation
+When the same reviewer has done 2+ STRIDE passes on related PRs in one session, blind spots compound. Delegate the 2nd pass to a subagent:
+```
+Task(subagent_type="Explore", prompt="""
+Run STRIDE PASSE 2 on this diff. Fresh eyes, no session history.
+CHANGED_FILES: [...]
+FOCUS: category you feel is weakest
+""")
+```
+## Stale item rotation
+Tracked via `learnings-capture`: if a killer checklist item passes (✓) in 10+ audits without catching anything, flag for review. Either:
+- The codebase is genuinely clean on that dimension → consider removing item
+- The item is too vague to fail → tighten the check
+Replace with a newer, more specific check.