@in-the-loop-labs/pair-review 2.0.2 → 2.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,260 @@
1
+ ---
2
+ name: review-model-guidance
3
+ description: Guidance for selecting models when performing code review with subtasks. Load this skill to enable intelligent model selection for review analysis — choosing faster models for simple tasks and deeper reasoning models for complex analysis.
4
+ ---
5
+
6
+ # Review Model Guidance
7
+
8
+ When performing code review analysis, you can switch models at any time to match
9
+ the demands of the task. This applies both to your own direct work and to subtasks
10
+ you delegate via the `task` tool. Use `/model` to switch your own model, or pass
11
+ the `model` parameter when delegating subtasks.
12
+
13
+ Before switching models, check which models are actually available by running
14
+ `${PI_CMD:-pi} --list-models` via bash. This shows all models with valid API keys
15
+ configured in `provider model` columns. Specify models as
16
+ `provider/model`. Only switch to models that appear in that list.
17
+
18
+ ## Forbidden Models
19
+
20
+ The following models are **never** permitted due to extreme cost:
21
+
22
+ - **openai/o3-pro** — Prohibitively expensive. Use `openai/o3` instead, which
23
+ provides strong reasoning at a fraction of the cost.
24
+
25
+ Do not select these models for your own work, subtasks, or any model parameter.
26
+ If a forbidden model is the only reasoning model available, fall back to a
27
+ balanced model with extended thinking enabled instead.
28
+
29
+ ## When to Switch Models
30
+
31
+ Not every task benefits from model switching. Use these heuristics:
32
+
33
+ - **Same model is fine** when you're doing a single focused task, or when the diff
34
+ is small enough that model choice won't materially affect quality.
35
+ - **Switch models** when you're delegating subtasks that have clearly different
36
+ complexity levels, you are reviewing different aspects of the changes, you are
37
+ reviewing from different perspectives, or when you want a second opinion from a
38
+ different model family on critical findings.
39
+
40
+ ## Model Selection by Task Type
41
+
42
+ ### Balanced Models
43
+
44
+ Use mid-tier models for general code review work:
45
+
46
+ - **File-context analysis**: Understanding how changes fit within a file's
47
+ existing patterns, checking for consistency with surrounding code
48
+ - **API contract review**: Verifying that function signatures, types, and
49
+ interfaces are used correctly
50
+ - **Test coverage assessment**: Evaluating whether test changes match code changes
51
+ - **Most general code review**: The default choice when you don't have a strong
52
+ reason to go deeper
53
+ - **Small to medium diffs**: The review itself is straightforward
54
+
55
+ Good balanced choices: `anthropic/claude-sonnet-4-5`, `google/gemini-2.5-pro`, `openai/gpt-5`
56
+
57
+ ### Deep Reasoning Models
58
+
59
+ For complex analysis, enable extended thinking or switch to a reasoning model.
60
+ Here are the higher-end models and their strengths for code review:
61
+
62
+ **Anthropic**
63
+ - **anthropic/claude-opus-4-6**: Anthropic's most capable model. Strongest at nuanced
64
+ architectural reasoning, understanding implicit design patterns, and catching
65
+ subtle issues that require deep understanding of intent. Excellent at security
66
+ review and explaining *why* something is problematic, not just *that* it is.
67
+ - **anthropic/claude-opus-4-5**: Previous-generation flagship. Still very strong for complex
68
+ review tasks. Good alternative when opus-4-6 is unavailable.
69
+ - **anthropic/claude-sonnet-4-5 with extended thinking**: Enable thinking for complex analysis
70
+ without switching models. Good balance of capability and responsiveness.
71
+
72
+ **OpenAI** (note: o3-pro is forbidden — see Forbidden Models above)
73
+ - **openai/o3**: OpenAI's best reasoning model for code review. Good for state management analysis, algorithmic
74
+ correctness, and methodical bug-hunting through code paths.
75
+ - **openai/gpt-5-pro / openai/gpt-5.2-pro**: OpenAI's flagship non-o-series models with
76
+ reasoning. Good general-purpose deep analysis.
77
+ - **openai/o4-mini**: Reasoning model suitable for targeted deep analysis of specific
78
+ files or functions.
79
+
80
+ **Google**
81
+ - **google/gemini-3-pro-preview**: Google's latest and most capable model (1M context).
82
+ Strong at cross-file analysis and understanding large codebases holistically.
83
+ - **google/gemini-2.5-pro with thinking**: Excellent at large-context analysis — can
84
+ reason over many files simultaneously with its 1M token context window. Good for
85
+ architectural consistency checks and understanding how changes ripple across a
86
+ large codebase.
87
+ - **google/gemini-2.5-flash with thinking**: When you need reasoning over large context
88
+ but want faster response times.
89
+
90
+ **xAI**
91
+ - **xai/grok-4**: xAI's strongest reasoning model. Good for getting a different
92
+ perspective from a different model family on critical findings.
93
+ - **xai/grok-4-fast / xai/grok-4-1-fast**: Reasoning models with massive 2M context
94
+ windows. Useful when you need to reason over an extremely large amount of code.
95
+ - **xai/grok-code-fast-1**: Code-specialized reasoning model (256k context). Consider
96
+ for code-focused analysis where code understanding is more important than general
97
+ reasoning breadth.
98
+
99
+ ### Bug Finding and Flaw Detection
100
+
101
+ When the goal is specifically to track down bugs or logical flaws in changes,
102
+ these models excel:
103
+
104
+ - **openai/o3**: The o-series models are particularly strong at systematic
105
+ bug-hunting. They methodically trace execution paths, track state through
106
+ branches, and identify edge cases. Best choice when you suspect there's a bug
107
+ and need to find it.
108
+ - **anthropic/claude-opus-4-6**: Excels at understanding developer intent and spotting where
109
+ the implementation diverges from what was likely intended. Good at catching bugs
110
+ that arise from misunderstanding an API or protocol.
111
+ - **google/gemini-2.5-pro with thinking**: Strong at finding bugs that manifest across
112
+ file boundaries — where a change in one file breaks an assumption in another.
113
+ The large context window helps hold the full picture.
114
+ - **xai/grok-code-fast-1**: Code-specialized model that can be effective for
115
+ language-specific bug patterns.
116
+
117
+ ### Code Generation Models
118
+
119
+ When a review suggestion includes a concrete code fix or refactor, switching to a
120
+ code-specialized model can produce better, more idiomatic suggestions:
121
+
122
+ - **openai/gpt-5.1-codex / openai/gpt-5.2-codex**: OpenAI's code-specialized models.
123
+ Best choice when generating substantive code suggestions — refactors, rewrites,
124
+ or proposed fixes. These models produce cleaner, more idiomatic code than
125
+ general-purpose models.
126
+ - **openai/codex-mini-latest**: A lighter code generation model. Good for smaller,
127
+ targeted code suggestions where speed matters more than handling complex
128
+ multi-file refactors.
129
+ - **xai/grok-code-fast-1**: Fast code generation with strong code understanding
130
+ (256k context). Useful when you need quick, code-focused suggestions and want
131
+ to avoid the latency of larger models.
132
+
133
+ Use code generation models when your review finding warrants a concrete code
134
+ example — a suggested fix, a refactored alternative, or an idiomatic replacement.
135
+ For findings that are purely analytical (architectural concerns, design feedback),
136
+ stick with reasoning or balanced models instead.
137
+
138
+ Use deep reasoning models for:
139
+
140
+ - **Architectural analysis**: How changes affect the broader codebase structure,
141
+ dependency patterns, separation of concerns
142
+ - **Security review**: Authentication, authorization, injection vulnerabilities,
143
+ cryptographic usage, secrets handling
144
+ - **Concurrency and state**: Race conditions, deadlocks, shared mutable state,
145
+ transaction boundaries
146
+ - **Complex algorithms**: Mathematical correctness, edge cases in complex logic,
147
+ performance characteristics
148
+ - **Systems code**: Rust, C, C++ — memory management, lifetime issues, unsafe blocks
149
+
150
+ ## Language and Framework Considerations
151
+
152
+ Some languages and domains benefit from specific models:
153
+
154
+ - **Rust / C / C++**: Memory safety, lifetimes, undefined behavior — use
155
+ `anthropic/claude-opus-4-6` or `openai/o3` for their strong reasoning about resource management.
156
+ `xai/grok-code-fast-1` is also worth considering for Rust-specific patterns.
157
+ - **TypeScript / JavaScript / React**: Most models handle well. `anthropic/claude-sonnet-4-5`
158
+ or `google/gemini-2.5-pro` are strong defaults. For complex state management (Redux,
159
+ hooks, async flows), use reasoning models.
160
+ - **Python**: Most models handle well. For ML/data pipeline code, `google/gemini-2.5-pro`
161
+ with thinking is strong given its deep Python training data.
162
+ - **SQL / database migrations**: Schema changes and data integrity — `openai/o3` is
163
+ strong at reasoning about relational constraints and migration ordering.
164
+ - **Infrastructure / IaC**: Terraform, CloudFormation, Kubernetes — security
165
+ implications benefit from `anthropic/claude-opus-4-6` or `openai/o3` for their security reasoning.
166
+ - **Shell scripts**: Security-sensitive (injection, permissions) — use at least
167
+ `anthropic/claude-sonnet-4-5` with thinking enabled.
168
+ - **Ruby / Rails**: `anthropic/claude-opus-4-6` and `openai/gpt-5` have strong Ruby understanding.
169
+ For Rails-specific patterns (N+1 queries, callback chains, ActiveRecord
170
+ pitfalls), reasoning models help trace the implicit execution flow.
171
+ - **Go**: Strong support across most models. For concurrency review (goroutines,
172
+ channels, sync primitives), prefer `openai/o3` for its systematic path tracing.
173
+
174
+ ## Subtask Strategy
175
+
176
+ The `task` tool supports parallel execution with per-task model selection. This is
177
+ the key unlock for code review: run multiple review perspectives simultaneously,
178
+ each with a model suited to the task.
179
+
180
+ ### Parallel with per-task models
181
+
182
+ Each task in the `tasks` array can specify its own model:
183
+
184
+ ```json
185
+ {
186
+ "tasks": [
187
+ { "task": "Review changed lines for bugs, logic errors, and edge cases.", "model": "openai/o3" },
188
+ { "task": "Analyze security implications of these changes.", "model": "anthropic/claude-opus-4-6" },
189
+ { "task": "Check architectural consistency with the broader codebase.", "model": "google/gemini-2.5-pro" }
190
+ ]
191
+ }
192
+ ```
193
+
194
+ Tasks without a `model` inherit the top-level `model` parameter, or the current
195
+ session model if neither is set.
196
+
197
+ ### Parallel with shared model
198
+
199
+ When all subtasks can use the same model, you can set a single top-level model:
200
+
201
+ ```json
202
+ {
203
+ "tasks": [
204
+ { "task": "Review the changed lines in isolation for bugs and issues." },
205
+ { "task": "Read the full files and check consistency with existing patterns." },
206
+ { "task": "Check test coverage for the changed code." }
207
+ ],
208
+ "model": "anthropic/claude-sonnet-4-5"
209
+ }
210
+ ```
211
+
212
+ ### Switching your own model
213
+
214
+ You don't always need subtasks to use a different model. You can switch your own
215
+ model mid-review using `/model` and continue working directly. This is useful when
216
+ you need specific expertise, or when you want to bring deeper reasoning to a
217
+ specific part of your analysis without the overhead of spawning a subtask.
218
+
219
+ ## When NOT to Use Subtasks
220
+
221
+ Before reaching for the `task` tool, ask: "Can I do this with `read`, `bash`, or
222
+ other built-in tools directly?" If yes, do it directly. Subtasks are for
223
+ **multi-step, context-heavy work** — not for simple operations.
224
+
225
+ **Never use subtasks for:**
226
+
227
+ - **Reading files**: Use the `read` tool directly. A subtask spawns a full `pi`
228
+ process just to call `read` — adding seconds of overhead and failure risk for
229
+ something that takes milliseconds.
230
+ - **Running basic commands**: `bash` with `git diff`, `rg`, `find`, etc. is
231
+ instant. Don't wrap these in subtasks.
232
+ - **Gathering context before review**: Read the files you need, run the commands
233
+ you need, then do your analysis. This is normal tool use, not subtask work.
234
+ - **Any single-tool operation**: If the task boils down to one `read` or `bash`
235
+ call, it doesn't need a subtask.
236
+
237
+ **Do use subtasks for:**
238
+
239
+ - Running **multiple independent review analyses in parallel**, each requiring
240
+ many tool calls and producing substantial output
241
+ - Work that would **consume significant context** in the parent session (e.g.,
242
+ reading and analyzing 20+ files)
243
+ - Getting a **different model's perspective** on complex findings
244
+
245
+ **Anti-pattern to avoid:** Don't dispatch 5 parallel subtasks to read 5 files.
246
+ Instead, read the 5 files yourself with 5 `read` calls (which can't fail due to
247
+ process spawn issues), then use subtasks only if you need parallel *analysis* of
248
+ the content.
249
+
250
+ ## When NOT to Switch Models
251
+
252
+ - If the user has explicitly requested a specific model, respect that choice
253
+ - If the diff is very small (under ~100 lines total), model switching adds
254
+ overhead without meaningful benefit — a single balanced model handles it fine
255
+ - Don't switch to a weaker/faster model for trivial operations — if the operation
256
+ is trivial enough for a weaker model, it's trivial enough to do directly without
257
+ a subtask at all
258
+ - Don't use model overrides as a default — only specify a model when you have a
259
+ clear reason that a *different* model would produce better results for that
260
+ specific subtask
@@ -0,0 +1,144 @@
1
+ ---
2
+ name: review-roulette
3
+ description: Dispatch a review task to 3 randomly-selected reasoning models in parallel for diverse perspectives, then merge all suggestions into a single result.
4
+ ---
5
+
6
+ <!-- SPDX-License-Identifier: GPL-3.0-or-later -->
7
+
8
+ # Review Roulette
9
+
10
+ When this skill is active, your ONLY job is orchestration — you do NOT perform
11
+ any review analysis yourself. You randomly select 3 reasoning models, dispatch
12
+ the review to all of them in parallel, and merge the results.
13
+
14
+ ## Step 1: Discover Available Reasoning Models
15
+
16
+ Run `${PI_CMD:-pi} --list-models` via bash to get the current list of models with
17
+ valid API keys. **Eligible models** are those that show `thinking: yes` in the
18
+ output — these are the reasoning-capable / premium models.
19
+
20
+ **Excluded models**: Never select `openai/o3-pro` — it is prohibitively
21
+ expensive. If it appears in the model list, skip it entirely.
22
+
23
+ Example reasoning models you might see (provider/model format):
24
+
25
+ - `anthropic/claude-opus-4-6`
26
+ - `anthropic/claude-sonnet-4-5` (with thinking)
27
+ - `openai/o3`
28
+ - `openai/o4-mini`
29
+ - `openai/gpt-5-pro`
30
+ - `openai/gpt-5.2-pro`
31
+ - `google/gemini-2.5-pro` (with thinking)
32
+ - `google/gemini-2.5-flash` (with thinking)
33
+ - `google/gemini-3-pro-preview`
34
+ - `xai/grok-4`
35
+
36
+ The exact list depends on which API keys are configured. Always check — do not
37
+ assume models are available.
38
+
39
+ ## Step 2: Randomly Select 3 Models
40
+
41
+ From the eligible reasoning models, pick **exactly 3** at random.
42
+
43
+ **CRITICAL — true randomness and diversity:**
44
+
45
+ - Do NOT always pick the same 3 models. The entire point of review roulette is
46
+ variety of perspectives across runs.
47
+ - **Prefer different providers** when possible. If you have reasoning models from
48
+ Anthropic, OpenAI, Google, and xAI, pick from 3 different providers. Only
49
+ double up on a provider if fewer than 3 providers have eligible models.
50
+ - Shuffle or randomize your selection each time. Do not default to alphabetical
51
+ order or any fixed preference.
52
+
53
+ ## Step 3: Dispatch the Review in Parallel
54
+
55
+ Use the `task` tool with the `tasks` array to dispatch all 3 reviews
56
+ simultaneously. Each task object must include:
57
+
58
+ 1. **`model`**: The selected model in `provider/model` format.
59
+ 2. **`task`**: The **FULL original review prompt/instructions**. Each subtask
60
+ starts fresh with NO conversation history and NO context from the parent.
61
+ You must forward EVERYTHING you were asked to do — the complete prompt, all
62
+ instructions, the diff, file contents, any constraints or formatting
63
+ requirements, the expected JSON output schema, etc. Do not summarize or
64
+ abbreviate. Pass it all through verbatim.
65
+
66
+ Example structure:
67
+
68
+ ```json
69
+ {
70
+ "tasks": [
71
+ {
72
+ "task": "<the ENTIRE original review prompt and instructions>",
73
+ "model": "anthropic/claude-opus-4-6"
74
+ },
75
+ {
76
+ "task": "<the ENTIRE original review prompt and instructions>",
77
+ "model": "openai/o3"
78
+ },
79
+ {
80
+ "task": "<the ENTIRE original review prompt and instructions>",
81
+ "model": "google/gemini-2.5-pro"
82
+ }
83
+ ]
84
+ }
85
+ ```
86
+
87
+ ## Step 4: Merge Results
88
+
89
+ Each subtask will return a review result containing a `summary` string and a
90
+ `suggestions` array (the standard review JSON format).
91
+
92
+ Collect the results from all 3 subtask responses and merge them:
93
+
94
+ - **Suggestions**: Concatenate all `suggestions` arrays into a single array.
95
+ - **Summary**: Concatenate all summaries with model attribution. Format the
96
+ merged summary as:
97
+
98
+ ```
99
+ <provider/model1>:
100
+ <summary1>
101
+
102
+ <provider/model2>:
103
+ <summary2>
104
+
105
+ <provider/model3>:
106
+ <summary3>
107
+ ```
108
+
109
+ This attributed format also serves as a record of which models were used in
110
+ the review.
111
+
112
+ Return the merged result as the final JSON response.
113
+
114
+ **Do NOT:**
115
+
116
+ - Deduplicate suggestions — let the consumer decide what overlaps
117
+ - Synthesize, summarize, or editorialize on the combined results
118
+ - Perform any review analysis yourself
119
+
120
+ **Do:**
121
+
122
+ - Concatenate all suggestion arrays: `[...model1, ...model2, ...model3]`
123
+ - Concatenate all summaries with `provider/model:` attribution as shown above
124
+ - Return the merged result as the final JSON response in the same schema the
125
+ subtasks used
126
+
127
+ ## Summary
128
+
129
+ ```
130
+ You (parent) Subtask 1 (model A)
131
+ │ │
132
+ ├── pick 3 random models ├── receive full prompt
133
+ ├── forward full prompt ──────► ├── perform review
134
+ │ └── return suggestions JSON
135
+
136
+ ├── forward full prompt ──────► Subtask 2 (model B) ──► suggestions JSON
137
+
138
+ ├── forward full prompt ──────► Subtask 3 (model C) ──► suggestions JSON
139
+
140
+ └── merge all summaries (with model attribution) + suggestions[] ──► final JSON response
141
+ ```
142
+
143
+ The parent does zero analysis. It is purely a dispatcher and merger.
144
+ Each model's summary is attributed so the final output records which models contributed.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@in-the-loop-labs/pair-review",
3
- "version": "2.0.2",
3
+ "version": "2.0.3",
4
4
  "description": "Your AI-powered code review partner - Close the feedback loop with AI coding agents",
5
5
  "main": "src/server.js",
6
6
  "bin": {
@@ -8,6 +8,7 @@
8
8
  "git-diff-lines": "bin/git-diff-lines"
9
9
  },
10
10
  "files": [
11
+ ".pi/",
11
12
  "bin/",
12
13
  "src/",
13
14
  "public/",
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "pair-review",
3
- "version": "2.0.2",
3
+ "version": "2.0.3",
4
4
  "description": "pair-review app integration — Open PRs and local changes in the pair-review web UI, run server-side AI analysis, and address review feedback. Requires the pair-review MCP server.",
5
5
  "author": {
6
6
  "name": "in-the-loop-labs",
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "code-critic",
3
- "version": "2.0.2",
3
+ "version": "2.0.3",
4
4
  "description": "AI-powered code review analysis — Run three-level AI analysis and implement-review-fix loops directly in your coding agent. Works standalone, no server required.",
5
5
  "author": {
6
6
  "name": "in-the-loop-labs",