@groupby/ai-dev 0.5.1 → 0.5.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,243 @@
1
+ ---
2
+ name: council-review
3
+ description: "Multi-model code review council inspired by Karpathy's LLM Council. Spawns 3 sub-agents on different models (Claude Opus 4.8, GPT-5.3 Codex, GPT-5.5) to independently review code changes, then synthesizes and votes on the best comments to produce a unified, high-signal review. Use when the user says /council-review, 'council review', 'multi-model review', 'review council', or 'LLM council'."
4
+ ---
5
+
6
+ # LLM Council Code Review
7
+
8
+ ## Purpose
9
+
10
+ Provide a high-quality, consensus-driven code review by running **three independent
11
+ reviewers on different LLM models**, then synthesizing their findings into a single
12
+ review ranked by agreement and severity — similar to
13
+ [Karpathy's LLM Council](https://github.com/karpathy/llm-council).
14
+
15
+ The insight: different models catch different things. One model may spot a race
16
+ condition another misses; one may flag a security issue the others gloss over.
17
+ By requiring consensus, noise drops and signal rises.
18
+
19
+ ## Models Used (The Council)
20
+
21
+ | Seat | Model ID | Strengths |
22
+ |------------|---------------------|-------------------------------------------------|
23
+ | Reviewer A | `claude-opus-4.8` | Deep reasoning, architecture, subtle logic bugs |
24
+ | Reviewer B | `gpt-5.3-codex` | Code-native, practical fixes, test gaps |
25
+ | Reviewer C | `gpt-5.5` | Broad knowledge, security, API design |
26
+
27
+ ## Trigger
28
+
29
+ Activate this skill when the user says any of:
30
+ - `/council-review`
31
+ - `council review my changes`
32
+ - `multi-model review`
33
+ - `LLM council review`
34
+ - `review council`
35
+
36
+ ## Inputs
37
+
38
+ The user may provide:
39
+ - **No argument** → review local uncommitted changes (staged + unstaged)
40
+ - **`--staged`** → review only staged changes
41
+ - **`--branch`** → review current branch diff vs `origin/main`
42
+ - **`--commits <N>`** → review the last N commits (ignores uncommitted changes)
43
+ - **`--commits <sha>..<sha>`** → review a specific commit range
44
+ - **`--pr <number>`** → review a specific GitHub PR
45
+ - **A file path or glob** → review only those files
46
+
47
+ Natural language also works:
48
+ - "review my last 2 commits" → same as `--commits 2`
49
+ - "review last 3 commits before I open a PR" → same as `--commits 3`
50
+
51
+ ## Workflow
52
+
53
+ ### Phase 0: Repo Discovery
54
+
55
+ Before reviewing any code, discover the **current repo's own rules**. Do not
56
+ carry assumptions from another repo.
57
+
58
+ 1. **Confirm repository scope:**
59
+ - Run `git status --short` and `git branch --show-current`.
60
+ - Identify the repo root and project type.
61
+
62
+ 2. **Discover review configuration:**
63
+ - Check for these files and read them if present:
64
+ - `.github/workflows/claude-pr-review.yml` or `.github/workflows/claude.yml`
65
+ - `.github/workflows/build-pr.yaml`
66
+ - `.github/PULL_REQUEST_TEMPLATE.md`
67
+ - `.github/CODEOWNERS`
68
+ - Run this skill's bundled `scripts/summarize_review_config.py` script for a
69
+ quick context summary — it is repo-agnostic and works on any repository.
70
+ Resolve the script from this skill's own base directory (shown in your skill
71
+ context; do not hard-code a personal/author path) and pass the current repo
72
+ root as its argument:
73
+
74
+ ```bash
75
+ # <skill-dir> = this skill's base directory, e.g.
76
+ # macOS/Linux: ~/.copilot/skills/council-review (or ~/.agents/skills/...)
77
+ # Windows: %USERPROFILE%\.copilot\skills\council-review
78
+ python3 "<skill-dir>/scripts/summarize_review_config.py" . # use `python` on Windows
79
+ ```
80
+
81
+ If Python is unavailable, fall back to reading the config files above manually.
82
+
83
+ 3. **Discover repo guidance (read if present):**
84
+ - `.github/copilot-instructions.md`
85
+ - `CLAUDE.md`, `AGENTS.md`
86
+ - `docs/conventions.md`, `docs/project-rule.md`, `docs/source-control.md`
87
+ - `README.md` (skim for architecture/setup sections)
88
+
89
+ 4. **Detect technology profile:**
90
+ Scan build files and guidance docs for technology markers. Apply the
91
+ matching profile from `references/technology-profiles.md`. Only apply
92
+ rules that the **current repo actually uses**.
93
+
94
+ Key markers to scan for:
95
+ - Java/Gradle: `build.gradle`, `build.gradle.kts`, `settings.gradle`
96
+ - Maven: `pom.xml`
97
+ - Go: `go.mod`, `Makefile`
98
+ - Python: `pyproject.toml`, `requirements*.txt`
99
+ - Node: `package.json`
100
+
101
+ 5. **Build a PROJECT_CONTEXT block** from all discovered information. This
102
+ block will be injected into every reviewer's prompt so all three models
103
+ review against the same repo-specific rules.
104
+
105
+ ### Phase 1: Gather the Diff
106
+
107
+ 1. Determine the review scope based on user input:
108
+ - **Local changes (default):** If working tree has edits, use `git diff HEAD`
109
+ (includes staged + unstaged). If working tree is clean but branch has
110
+ commits, compare against the PR base: `git diff origin/main...HEAD`.
111
+ If `origin/main` is not available, inspect upstream and available remotes.
112
+ - **Staged only:** `git diff --cached`
113
+ - **Branch diff:** `git diff origin/main...HEAD`
114
+ - **Last N commits:** `git diff HEAD~N..HEAD` (ignores working tree entirely)
115
+ Example: `--commits 2` → `git diff HEAD~2..HEAD`
116
+ - **Commit range:** `git diff <sha1>..<sha2>` for explicit ranges
117
+ - **PR:** `gh pr diff <number>`
118
+ 2. Also gather context:
119
+ - `git diff --stat` for the file change summary
120
+ - The PROJECT_CONTEXT block built in Phase 0
121
+ 3. If the diff is empty, tell the user and stop.
122
+ 4. If the diff is very large (>5000 lines), warn the user and suggest narrowing scope.
123
+ 5. **Classify the change** (helps reviewers focus):
124
+ - API/controller, service/orchestration, repository/database, search engine,
125
+ Mongo query/indexing, cache, Pub/Sub/messaging, auth/security, feature flags,
126
+ docs, tests, build/dependency, deployment, or tooling.
127
+
128
+ ### Phase 2: Deploy the Council (Parallel Sub-Agents)
129
+
130
+ Launch **exactly 3 `code-review` agents in parallel** using the `task` tool, each
131
+ with a different `model` parameter. All three receive the **identical prompt** so
132
+ their reviews are directly comparable.
133
+
134
+ **CRITICAL: Launch all 3 in a single response — they run in parallel.**
135
+
136
+ Each agent receives the prompt from `references/reviewer-prompt.md`, with the
137
+ diff and project context injected.
138
+
139
+ ```
140
+ Agent A: task(agent_type="code-review", model="claude-opus-4.8", ...)
141
+ Agent B: task(agent_type="code-review", model="gpt-5.3-codex", ...)
142
+ Agent C: task(agent_type="code-review", model="gpt-5.5", ...)
143
+ ```
144
+
145
+ All three agents run in `mode="background"`. Wait for all three to complete
146
+ before proceeding to Phase 3.
147
+
148
+ ### Phase 3: Collect & Parse Reviews
149
+
150
+ Read all three agent results. Each agent returns findings in the structured
151
+ format defined in `references/reviewer-prompt.md`. Extract:
152
+ - File path and line range for each comment
153
+ - Severity (P1/P2/P3)
154
+ - Category (bug, security, performance, style, test-gap, design)
155
+ - The finding description and suggested fix
156
+
157
+ ### Phase 4: Council Vote — Synthesize & Rank
158
+
159
+ This is the core "council" step. Process the three reviews:
160
+
161
+ #### 4a. Deduplicate
162
+
163
+ Group comments that refer to the **same issue** (same file, overlapping lines,
164
+ same root cause). Two comments are "the same issue" if they:
165
+ - Point to the same file and overlapping line range, AND
166
+ - Describe the same underlying problem (even in different words)
167
+
168
+ #### 4b. Score by Agreement
169
+
170
+ For each unique issue, count how many of the 3 reviewers flagged it:
171
+
172
+ | Agreement | Label | Weight |
173
+ |-----------|--------------|--------|
174
+ | 3/3 | 🟢 Unanimous | High |
175
+ | 2/3 | 🟡 Majority | Medium |
176
+ | 1/3 | 🔵 Solo | Low |
177
+
178
+ #### 4c. Rank
179
+
180
+ Sort the final list by:
181
+ 1. **Agreement** (unanimous > majority > solo)
182
+ 2. **Severity** (P1 > P2 > P3) within each agreement tier
183
+ 3. Within the same tier+severity, keep the most actionable/clear version of
184
+ the comment (pick the best phrasing from whichever model wrote it)
185
+
186
+ #### 4d. Solo Comment Filter
187
+
188
+ Solo comments (1/3) are **not discarded** but are presented separately under
189
+ a "Minority Opinions" section. They may contain genuine catches the other
190
+ models missed, or they may be noise. Let the user decide.
191
+
192
+ ### Phase 5: Present the Council Review
193
+
194
+ Output the review using the format in `references/output-format.md`.
195
+
196
+ ## Hard Rules
197
+
198
+ - **Identical prompts.** All three reviewers get exactly the same input.
199
+ Do not customize prompts per model — the whole point is fair comparison.
200
+ - **No model bias.** Do not weight one model's opinion over another during
201
+ voting. Agreement count is the only ranking signal.
202
+ - **Parallel launch.** Always launch all 3 agents in a single response.
203
+ Never run them sequentially.
204
+ - **Transparency.** Always show which models agreed on each finding.
205
+ - **No hallucinated code.** Do not generate suggested replacement code
206
+ yourself during synthesis. Use the reviewers' suggestions as-is.
207
+ - **Severity consistency.** If reviewers disagree on severity for the same
208
+ issue, use the highest severity any reviewer assigned.
209
+ - **Signal over noise.** The council exists to reduce noise. If a comment
210
+ is unclear or contradictory across reviewers, note the disagreement rather
211
+ than forcing consensus.
212
+
213
+ ## Configuration
214
+
215
+ The user can customize the council by telling the agent:
216
+ - Different models: "use Opus 4.5 instead of Opus 4.8"
217
+ - Different number of reviewers: "use 5 models" (but default is 3)
218
+ - Focus areas: "focus on security" or "focus on performance"
219
+ - Strictness: "be strict" (lower the noise threshold) or "only critical" (P1 only)
220
+
221
+ ## Error Handling
222
+
223
+ - If one agent fails, proceed with the remaining 2. Note the failure.
224
+ - If two agents fail, fall back to a single-model review and explain.
225
+ - If all three fail, tell the user and suggest running a simple code-review instead.
226
+
227
+ ## Phase 6 (Optional): Post-Review Verification
228
+
229
+ After presenting the council review, **offer** to run verification. Do not
230
+ run automatically — the user may just want the review.
231
+
232
+ If the user accepts:
233
+
234
+ 1. **Run targeted tests** for changed files using the repo's test command
235
+ (discovered in Phase 0). Prefer the narrowest test scope first.
236
+ 2. **Run the PR build command** when feasible (from `build-pr.yaml`).
237
+ 3. **Run `git diff --check`** for whitespace issues.
238
+ 4. **Check PR template compliance** — if the repo has a PR template with
239
+ checkboxes, note which items are affected by the change.
240
+ 5. **Report CODEOWNERS** — if the repo has CODEOWNERS, note which owners
241
+ are relevant for the changed files.
242
+
243
+ Append verification results to the review output.
@@ -0,0 +1,108 @@
1
+ # Council Review Output Format
2
+
3
+ Use this format when presenting the synthesized council review to the user.
4
+
5
+ ---
6
+
7
+ ## Header
8
+
9
+ ```
10
+ # 🏛️ LLM Council Code Review
11
+
12
+ **Scope:** <description of what was reviewed — branch, PR #, local changes>
13
+ **Council:** Claude Opus 4.8 · GPT-5.3 Codex · GPT-5.5
14
+ **Date:** <current date>
15
+ **Verdict:** <PASS | PASS WITH COMMENTS | NEEDS CHANGES>
16
+ ```
17
+
18
+ ### Verdict Rules
19
+ - **PASS** — No P1 or P2 issues found by any reviewer
20
+ - **PASS WITH COMMENTS** — No P1 issues; some P2/P3 found
21
+ - **NEEDS CHANGES** — At least one P1 issue found, OR 3+ P2 issues with majority agreement
22
+
23
+ ---
24
+
25
+ ## Consensus Findings (2/3 or 3/3 agreement)
26
+
27
+ These are issues flagged by multiple models independently. High confidence.
28
+
29
+ For each finding:
30
+
31
+ ```
32
+ ### <N>. <One-line summary>
33
+ 🟢 Unanimous (3/3) | 🟡 Majority (2/3)
34
+ **Severity:** P1 | P2 | P3
35
+ **Category:** <category>
36
+ **File:** `<path>` (lines ~<range>)
37
+ **Agreed by:** Opus 4.8 ✓ · Codex 5.3 ✓ · GPT-5.5 ✓
38
+
39
+ <Best description from the reviewers. Pick the clearest, most actionable version.>
40
+
41
+ **Suggested fix:**
42
+ <Most concrete suggestion from any reviewer.>
43
+ ```
44
+
45
+ ---
46
+
47
+ ## Minority Opinions (1/3 — solo catches)
48
+
49
+ These were flagged by only one model. They may be genuine catches the others
50
+ missed, or false positives. Included for completeness.
51
+
52
+ For each:
53
+
54
+ ```
55
+ ### <N>. <One-line summary>
56
+ 🔵 Solo — flagged by <model name> only
57
+ **Severity:** P1 | P2 | P3
58
+ **Category:** <category>
59
+ **File:** `<path>` (lines ~<range>)
60
+
61
+ <Description from the flagging model.>
62
+
63
+ **Suggested fix:**
64
+ <Suggestion if provided.>
65
+ ```
66
+
67
+ ---
68
+
69
+ ## Review Statistics
70
+
71
+ ```
72
+ | Metric | Value |
73
+ |---------------------------|-------|
74
+ | Total unique issues | <N> |
75
+ | Unanimous (3/3) | <N> |
76
+ | Majority (2/3) | <N> |
77
+ | Solo (1/3) | <N> |
78
+ | P1 (Critical) | <N> |
79
+ | P2 (Important) | <N> |
80
+ | P3 (Minor) | <N> |
81
+ | Files reviewed | <N> |
82
+ | Lines changed | +<N> / -<N> |
83
+ ```
84
+
85
+ ---
86
+
87
+ ## Model Agreement Matrix (optional, for large reviews)
88
+
89
+ Show which model caught what. Only include for reviews with 5+ findings.
90
+
91
+ ```
92
+ | # | Finding | Opus 4.8 | Codex 5.3 | GPT-5.5 |
93
+ |---|--------------------------------|----------|-----------|---------|
94
+ | 1 | Race condition in UserService | ✓ | ✓ | ✓ |
95
+ | 2 | Missing null check in parser | ✓ | ✓ | |
96
+ | 3 | SQL injection in search filter | | ✓ | ✓ |
97
+ | 4 | Unused import (solo) | ✓ | | |
98
+ ```
99
+
100
+ ---
101
+
102
+ ## Footer
103
+
104
+ ```
105
+ ---
106
+ *Review generated by LLM Council · 3 independent models · consensus-ranked*
107
+ *Models may miss issues. This review supplements, not replaces, human judgment.*
108
+ ```
@@ -0,0 +1,99 @@
1
+ # Reviewer Prompt Template
2
+
3
+ You are one member of a 3-model code review council. Your job is to independently
4
+ review the code changes below and produce high-signal findings. Another process
5
+ will compare your review against two other models' reviews to find consensus.
6
+
7
+ ## Your Review Constraints
8
+
9
+ - **Only flag things that genuinely matter.** Bugs, security issues, logic errors,
10
+ performance problems, missing error handling, test gaps for changed code.
11
+ - **Never comment on style, formatting, naming preferences, or trivial matters**
12
+ unless they cause a real problem (e.g., a misleading variable name that could
13
+ cause a bug).
14
+ - **Be specific.** Always include the file path, approximate line range, and a
15
+ concrete description of the problem.
16
+ - **Suggest a fix** when possible. Don't just say "this is wrong" — say what to do.
17
+ - **Don't be redundant.** If two issues share the same root cause, report it once.
18
+ - **Respect repo-specific rules.** The project context below includes this repo's
19
+ own conventions, technology profile, and CI expectations. Review against THOSE
20
+ rules, not generic best practices. Do not assume patterns from other repos.
21
+
22
+ ## Project Context
23
+
24
+ {PROJECT_CONTEXT}
25
+
26
+ This context was discovered from the repo's own guidance files, CI workflows,
27
+ build configuration, and technology markers. If the context mentions specific
28
+ patterns (e.g., Micronaut DI, tenant isolation, cache key format), verify the
29
+ diff follows them. If the context is silent on something, don't invent rules.
30
+
31
+ ## Technology Profile
32
+
33
+ {TECHNOLOGY_PROFILE}
34
+
35
+ ## Change Classification
36
+
37
+ {CHANGE_CLASSIFICATION}
38
+
39
+ ## Review Focus Areas (from repo's CI/review config)
40
+
41
+ Review against these standard areas, but weight them based on the change
42
+ classification above:
43
+
44
+ 1. **Code quality:** single responsibility, clarity, maintainability, unnecessary
45
+ complexity/nesting, redundant abstractions, local style conventions.
46
+ 2. **Security:** auth, authorization, tenant isolation, input validation, secrets,
47
+ sensitive data exposure.
48
+ 3. **Performance:** database/query shape, cache behavior, external calls,
49
+ async/blocking boundaries, memory/resource lifecycle.
50
+ 4. **Testing:** adequate coverage for changed code, edge cases, missing scenarios.
51
+ 5. **Documentation:** README/docs/OpenAPI/API docs accuracy when behavior changes.
52
+
53
+ ## Changed Files Summary
54
+
55
+ {DIFF_STAT}
56
+
57
+ ## Full Diff
58
+
59
+ {DIFF}
60
+
61
+ ## Output Format
62
+
63
+ Return your findings as a structured list. Each finding must follow this exact format:
64
+
65
+ ```
66
+ ### Finding <N>
67
+ - **File:** <path/to/file>
68
+ - **Lines:** <start>-<end> (approximate)
69
+ - **Severity:** P1 | P2 | P3
70
+ - **Category:** bug | security | performance | design | test-gap | error-handling | concurrency | data-integrity
71
+ - **Summary:** <one-line summary>
72
+ - **Details:** <1-3 sentences explaining the issue>
73
+ - **Suggestion:** <concrete fix or action>
74
+ ```
75
+
76
+ ### Severity Guide
77
+
78
+ - **P1 — Critical:** Likely bug, security vulnerability, data loss, crash, race condition,
79
+ or production incident. Must fix before merge.
80
+ - **P2 — Important:** Behavior regression, missing important test, incorrect error handling,
81
+ performance issue under realistic load. Should fix before merge.
82
+ - **P3 — Minor:** Low-risk test gap, minor inefficiency, documentation inaccuracy.
83
+ Nice to fix but not blocking.
84
+
85
+ ### What NOT to Report
86
+
87
+ - Style or formatting preferences
88
+ - "Consider renaming X" suggestions
89
+ - "Add a comment explaining Y" suggestions
90
+ - Import ordering
91
+ - Trailing whitespace
92
+ - Suggestions that don't prevent a real problem
93
+
94
+ If you find zero genuine issues, return:
95
+
96
+ ```
97
+ ### No Issues Found
98
+ The changes look correct. No bugs, security issues, or significant concerns identified.
99
+ ```
@@ -0,0 +1,54 @@
1
+ # Technology Profiles
2
+
3
+ Apply a profile **only** when discovered in the current repo's files.
4
+ Do not carry assumptions from one repo to another.
5
+
6
+ ## Java / Micronaut
7
+
8
+ - Check the Java version from workflow and Gradle config (do not assume 21 — some repos use 17).
9
+ - Use Micronaut compile-time DI patterns; avoid Spring Boot assumptions unless the repo is Spring-based.
10
+ - Prefer constructor injection and the Lombok annotations already used locally.
11
+ - Follow the repo's `var` rule (many Rezolve Java repos use `var` for new variables created with `new`).
12
+ - Check `@Singleton` vs `@Context` scope, `@Named` qualifiers, `@ExecuteOn` boundaries.
13
+ - Verify blocking operations are not on the event loop thread.
14
+
15
+ ## JOOQ / Flyway / Database
16
+
17
+ - Migration names and generated classes matter — check naming conventions.
18
+ - Check tenant isolation on queries and mutation side effects (events, audit logs).
19
+ - Verify transaction boundaries and connection management.
20
+
21
+ ## Search Services
22
+
23
+ - Check strategy and engine selection order.
24
+ - Ensure request builders, filters, refinements, biasing, pagination, and response builders preserve behavior.
25
+ - For Google Retail: check proto conversion, request fields, fallback behavior.
26
+ - For Mongo Atlas Search: check aggregation stages, index assumptions, field paths, unsupported Google-only features.
27
+ - For Redis caches: check key composition, tenant/collection/area isolation, TTLs, skip-cache paths.
28
+
29
+ ## Mongo Data / Indexing
30
+
31
+ - Check collection and tenant scoping.
32
+ - Check aggregation pipeline correctness, projections, variant/inventory handling.
33
+ - Check index definition generation, conditional indexing, feature flags.
34
+
35
+ ## Authentication / Security
36
+
37
+ - Check token validation, claims, expiration, signature algorithms, public endpoint boundaries.
38
+ - Check Redis key storage, key rotation, secret handling.
39
+ - Check Pub/Sub credential update idempotency.
40
+
41
+ ## Go / Python / Node
42
+
43
+ - Prefer repo-provided commands from Makefile, package files, workflow files, or docs.
44
+ - Keep tests close to the changed package/module.
45
+ - Check schema/serialization compatibility and environment variable handling.
46
+ - Do not import Java/Micronaut assumptions into these repos.
47
+
48
+ ## General (all profiles)
49
+
50
+ - **Code quality:** Single responsibility, clarity, maintainability, unnecessary complexity.
51
+ - **Security:** Auth, authorization, tenant isolation, input validation, secrets.
52
+ - **Performance:** Database/query shape, cache behavior, external calls, async/blocking.
53
+ - **Testing:** Adequate coverage for changed code, edge cases, no superfluous tests.
54
+ - **Documentation:** README/docs/OpenAPI accuracy when behavior changes.