npm - @in-the-loop-labs/pair-review - Versions diffs - 2.0.2 → 2.0.3 - Mend

@in-the-loop-labs/pair-review 2.0.2 → 2.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/.pi/extensions/task/index.ts +722 -0
package/.pi/skills/pair-review-api/SKILL.md +448 -0
package/.pi/skills/review-model-guidance/SKILL.md +260 -0
package/.pi/skills/review-roulette/SKILL.md +144 -0
package/package.json +2 -1
package/plugin/.claude-plugin/plugin.json +1 -1
package/plugin-code-critic/.claude-plugin/plugin.json +1 -1

package/.pi/skills/review-model-guidance/SKILL.md ADDED Viewed

@@ -0,0 +1,260 @@
+---
+name: review-model-guidance
+description: Guidance for selecting models when performing code review with subtasks. Load this skill to enable intelligent model selection for review analysis — choosing faster models for simple tasks and deeper reasoning models for complex analysis.
+---
+# Review Model Guidance
+When performing code review analysis, you can switch models at any time to match
+the demands of the task. This applies both to your own direct work and to subtasks
+you delegate via the `task` tool. Use `/model` to switch your own model, or pass
+the `model` parameter when delegating subtasks.
+Before switching models, check which models are actually available by running
+`${PI_CMD:-pi} --list-models` via bash. This shows all models with valid API keys
+configured in `provider   model` columns. Specify models as
+`provider/model`. Only switch to models that appear in that list.
+## Forbidden Models
+The following models are **never** permitted due to extreme cost:
+- **openai/o3-pro** — Prohibitively expensive. Use `openai/o3` instead, which
+  provides strong reasoning at a fraction of the cost.
+Do not select these models for your own work, subtasks, or any model parameter.
+If a forbidden model is the only reasoning model available, fall back to a
+balanced model with extended thinking enabled instead.
+## When to Switch Models
+Not every task benefits from model switching. Use these heuristics:
+- **Same model is fine** when you're doing a single focused task, or when the diff
+  is small enough that model choice won't materially affect quality.
+- **Switch models** when you're delegating subtasks that have clearly different
+  complexity levels, you are reviewing different aspects of the changes, you are
+  reviewing from different perspectives, or when you want a second opinion from a
+  different model family on critical findings.
+## Model Selection by Task Type
+### Balanced Models
+Use mid-tier models for general code review work:
+- **File-context analysis**: Understanding how changes fit within a file's
+  existing patterns, checking for consistency with surrounding code
+- **API contract review**: Verifying that function signatures, types, and
+  interfaces are used correctly
+- **Test coverage assessment**: Evaluating whether test changes match code changes
+- **Most general code review**: The default choice when you don't have a strong
+  reason to go deeper
+- **Small to medium diffs**: The review itself is straightforward
+Good balanced choices: `anthropic/claude-sonnet-4-5`, `google/gemini-2.5-pro`, `openai/gpt-5`
+### Deep Reasoning Models
+For complex analysis, enable extended thinking or switch to a reasoning model.
+Here are the higher-end models and their strengths for code review:
+**Anthropic**
+- **anthropic/claude-opus-4-6**: Anthropic's most capable model. Strongest at nuanced
+  architectural reasoning, understanding implicit design patterns, and catching
+  subtle issues that require deep understanding of intent. Excellent at security
+  review and explaining *why* something is problematic, not just *that* it is.
+- **anthropic/claude-opus-4-5**: Previous-generation flagship. Still very strong for complex
+  review tasks. Good alternative when opus-4-6 is unavailable.
+- **anthropic/claude-sonnet-4-5 with extended thinking**: Enable thinking for complex analysis
+  without switching models. Good balance of capability and responsiveness.
+**OpenAI** (note: o3-pro is forbidden — see Forbidden Models above)
+- **openai/o3**: OpenAI's best reasoning model for code review. Good for state management analysis, algorithmic
+  correctness, and methodical bug-hunting through code paths.
+- **openai/gpt-5-pro / openai/gpt-5.2-pro**: OpenAI's flagship non-o-series models with
+  reasoning. Good general-purpose deep analysis.
+- **openai/o4-mini**: Reasoning model suitable for targeted deep analysis of specific
+  files or functions.
+**Google**
+- **google/gemini-3-pro-preview**: Google's latest and most capable model (1M context).
+  Strong at cross-file analysis and understanding large codebases holistically.
+- **google/gemini-2.5-pro with thinking**: Excellent at large-context analysis — can
+  reason over many files simultaneously with its 1M token context window. Good for
+  architectural consistency checks and understanding how changes ripple across a
+  large codebase.
+- **google/gemini-2.5-flash with thinking**: When you need reasoning over large context
+  but want faster response times.
+**xAI**
+- **xai/grok-4**: xAI's strongest reasoning model. Good for getting a different
+  perspective from a different model family on critical findings.
+- **xai/grok-4-fast / xai/grok-4-1-fast**: Reasoning models with massive 2M context
+  windows. Useful when you need to reason over an extremely large amount of code.
+- **xai/grok-code-fast-1**: Code-specialized reasoning model (256k context). Consider
+  for code-focused analysis where code understanding is more important than general
+  reasoning breadth.
+### Bug Finding and Flaw Detection
+When the goal is specifically to track down bugs or logical flaws in changes,
+these models excel:
+- **openai/o3**: The o-series models are particularly strong at systematic
+  bug-hunting. They methodically trace execution paths, track state through
+  branches, and identify edge cases. Best choice when you suspect there's a bug
+  and need to find it.
+- **anthropic/claude-opus-4-6**: Excels at understanding developer intent and spotting where
+  the implementation diverges from what was likely intended. Good at catching bugs
+  that arise from misunderstanding an API or protocol.
+- **google/gemini-2.5-pro with thinking**: Strong at finding bugs that manifest across
+  file boundaries — where a change in one file breaks an assumption in another.
+  The large context window helps hold the full picture.
+- **xai/grok-code-fast-1**: Code-specialized model that can be effective for
+  language-specific bug patterns.
+### Code Generation Models
+When a review suggestion includes a concrete code fix or refactor, switching to a
+code-specialized model can produce better, more idiomatic suggestions:
+- **openai/gpt-5.1-codex / openai/gpt-5.2-codex**: OpenAI's code-specialized models.
+  Best choice when generating substantive code suggestions — refactors, rewrites,
+  or proposed fixes. These models produce cleaner, more idiomatic code than
+  general-purpose models.
+- **openai/codex-mini-latest**: A lighter code generation model. Good for smaller,
+  targeted code suggestions where speed matters more than handling complex
+  multi-file refactors.
+- **xai/grok-code-fast-1**: Fast code generation with strong code understanding
+  (256k context). Useful when you need quick, code-focused suggestions and want
+  to avoid the latency of larger models.
+Use code generation models when your review finding warrants a concrete code
+example — a suggested fix, a refactored alternative, or an idiomatic replacement.
+For findings that are purely analytical (architectural concerns, design feedback),
+stick with reasoning or balanced models instead.
+Use deep reasoning models for:
+- **Architectural analysis**: How changes affect the broader codebase structure,
+  dependency patterns, separation of concerns
+- **Security review**: Authentication, authorization, injection vulnerabilities,
+  cryptographic usage, secrets handling
+- **Concurrency and state**: Race conditions, deadlocks, shared mutable state,
+  transaction boundaries
+- **Complex algorithms**: Mathematical correctness, edge cases in complex logic,
+  performance characteristics
+- **Systems code**: Rust, C, C++ — memory management, lifetime issues, unsafe blocks
+## Language and Framework Considerations
+Some languages and domains benefit from specific models:
+- **Rust / C / C++**: Memory safety, lifetimes, undefined behavior — use
+  `anthropic/claude-opus-4-6` or `openai/o3` for their strong reasoning about resource management.
+  `xai/grok-code-fast-1` is also worth considering for Rust-specific patterns.
+- **TypeScript / JavaScript / React**: Most models handle well. `anthropic/claude-sonnet-4-5`
+  or `google/gemini-2.5-pro` are strong defaults. For complex state management (Redux,
+  hooks, async flows), use reasoning models.
+- **Python**: Most models handle well. For ML/data pipeline code, `google/gemini-2.5-pro`
+  with thinking is strong given its deep Python training data.
+- **SQL / database migrations**: Schema changes and data integrity — `openai/o3` is
+  strong at reasoning about relational constraints and migration ordering.
+- **Infrastructure / IaC**: Terraform, CloudFormation, Kubernetes — security
+  implications benefit from `anthropic/claude-opus-4-6` or `openai/o3` for their security reasoning.
+- **Shell scripts**: Security-sensitive (injection, permissions) — use at least
+  `anthropic/claude-sonnet-4-5` with thinking enabled.
+- **Ruby / Rails**: `anthropic/claude-opus-4-6` and `openai/gpt-5` have strong Ruby understanding.
+  For Rails-specific patterns (N+1 queries, callback chains, ActiveRecord
+  pitfalls), reasoning models help trace the implicit execution flow.
+- **Go**: Strong support across most models. For concurrency review (goroutines,
+  channels, sync primitives), prefer `openai/o3` for its systematic path tracing.
+## Subtask Strategy
+The `task` tool supports parallel execution with per-task model selection. This is
+the key unlock for code review: run multiple review perspectives simultaneously,
+each with a model suited to the task.
+### Parallel with per-task models
+Each task in the `tasks` array can specify its own model:
+```json
+{
+  "tasks": [
+    { "task": "Review changed lines for bugs, logic errors, and edge cases.", "model": "openai/o3" },
+    { "task": "Analyze security implications of these changes.", "model": "anthropic/claude-opus-4-6" },
+    { "task": "Check architectural consistency with the broader codebase.", "model": "google/gemini-2.5-pro" }
+  ]
+}
+```
+Tasks without a `model` inherit the top-level `model` parameter, or the current
+session model if neither is set.
+### Parallel with shared model
+When all subtasks can use the same model, you can set a single top-level model:
+```json
+{
+  "tasks": [
+    { "task": "Review the changed lines in isolation for bugs and issues." },
+    { "task": "Read the full files and check consistency with existing patterns." },
+    { "task": "Check test coverage for the changed code." }
+  ],
+  "model": "anthropic/claude-sonnet-4-5"
+}
+```
+### Switching your own model
+You don't always need subtasks to use a different model. You can switch your own
+model mid-review using `/model` and continue working directly. This is useful when
+you need specific expertise, or when you want to bring deeper reasoning to a
+specific part of your analysis without the overhead of spawning a subtask.
+## When NOT to Use Subtasks
+Before reaching for the `task` tool, ask: "Can I do this with `read`, `bash`, or
+other built-in tools directly?" If yes, do it directly. Subtasks are for
+**multi-step, context-heavy work** — not for simple operations.
+**Never use subtasks for:**
+- **Reading files**: Use the `read` tool directly. A subtask spawns a full `pi`
+  process just to call `read` — adding seconds of overhead and failure risk for
+  something that takes milliseconds.
+- **Running basic commands**: `bash` with `git diff`, `rg`, `find`, etc. is
+  instant. Don't wrap these in subtasks.
+- **Gathering context before review**: Read the files you need, run the commands
+  you need, then do your analysis. This is normal tool use, not subtask work.
+- **Any single-tool operation**: If the task boils down to one `read` or `bash`
+  call, it doesn't need a subtask.
+**Do use subtasks for:**
+- Running **multiple independent review analyses in parallel**, each requiring
+  many tool calls and producing substantial output
+- Work that would **consume significant context** in the parent session (e.g.,
+  reading and analyzing 20+ files)
+- Getting a **different model's perspective** on complex findings
+**Anti-pattern to avoid:** Don't dispatch 5 parallel subtasks to read 5 files.
+Instead, read the 5 files yourself with 5 `read` calls (which can't fail due to
+process spawn issues), then use subtasks only if you need parallel *analysis* of
+the content.
+## When NOT to Switch Models
+- If the user has explicitly requested a specific model, respect that choice
+- If the diff is very small (under ~100 lines total), model switching adds
+  overhead without meaningful benefit — a single balanced model handles it fine
+- Don't switch to a weaker/faster model for trivial operations — if the operation
+  is trivial enough for a weaker model, it's trivial enough to do directly without
+  a subtask at all
+- Don't use model overrides as a default — only specify a model when you have a
+  clear reason that a *different* model would produce better results for that
+  specific subtask

package/.pi/skills/review-roulette/SKILL.md ADDED Viewed

@@ -0,0 +1,144 @@
+---
+name: review-roulette
+description: Dispatch a review task to 3 randomly-selected reasoning models in parallel for diverse perspectives, then merge all suggestions into a single result.
+---
+<!-- SPDX-License-Identifier: GPL-3.0-or-later -->
+# Review Roulette
+When this skill is active, your ONLY job is orchestration — you do NOT perform
+any review analysis yourself. You randomly select 3 reasoning models, dispatch
+the review to all of them in parallel, and merge the results.
+## Step 1: Discover Available Reasoning Models
+Run `${PI_CMD:-pi} --list-models` via bash to get the current list of models with
+valid API keys. **Eligible models** are those that show `thinking: yes` in the
+output — these are the reasoning-capable / premium models.
+**Excluded models**: Never select `openai/o3-pro` — it is prohibitively
+expensive. If it appears in the model list, skip it entirely.
+Example reasoning models you might see (provider/model format):
+- `anthropic/claude-opus-4-6`
+- `anthropic/claude-sonnet-4-5` (with thinking)
+- `openai/o3`
+- `openai/o4-mini`
+- `openai/gpt-5-pro`
+- `openai/gpt-5.2-pro`
+- `google/gemini-2.5-pro` (with thinking)
+- `google/gemini-2.5-flash` (with thinking)
+- `google/gemini-3-pro-preview`
+- `xai/grok-4`
+The exact list depends on which API keys are configured. Always check — do not
+assume models are available.
+## Step 2: Randomly Select 3 Models
+From the eligible reasoning models, pick **exactly 3** at random.
+**CRITICAL — true randomness and diversity:**
+- Do NOT always pick the same 3 models. The entire point of review roulette is
+  variety of perspectives across runs.
+- **Prefer different providers** when possible. If you have reasoning models from
+  Anthropic, OpenAI, Google, and xAI, pick from 3 different providers. Only
+  double up on a provider if fewer than 3 providers have eligible models.
+- Shuffle or randomize your selection each time. Do not default to alphabetical
+  order or any fixed preference.
+## Step 3: Dispatch the Review in Parallel
+Use the `task` tool with the `tasks` array to dispatch all 3 reviews
+simultaneously. Each task object must include:
+1. **`model`**: The selected model in `provider/model` format.
+2. **`task`**: The **FULL original review prompt/instructions**. Each subtask
+   starts fresh with NO conversation history and NO context from the parent.
+   You must forward EVERYTHING you were asked to do — the complete prompt, all
+   instructions, the diff, file contents, any constraints or formatting
+   requirements, the expected JSON output schema, etc. Do not summarize or
+   abbreviate. Pass it all through verbatim.
+Example structure:
+```json
+{
+  "tasks": [
+    {
+      "task": "<the ENTIRE original review prompt and instructions>",
+      "model": "anthropic/claude-opus-4-6"
+    },
+    {
+      "task": "<the ENTIRE original review prompt and instructions>",
+      "model": "openai/o3"
+    },
+    {
+      "task": "<the ENTIRE original review prompt and instructions>",
+      "model": "google/gemini-2.5-pro"
+    }
+  ]
+}
+```
+## Step 4: Merge Results
+Each subtask will return a review result containing a `summary` string and a
+`suggestions` array (the standard review JSON format).
+Collect the results from all 3 subtask responses and merge them:
+- **Suggestions**: Concatenate all `suggestions` arrays into a single array.
+- **Summary**: Concatenate all summaries with model attribution. Format the
+  merged summary as:
+  ```
+  <provider/model1>:
+  <summary1>
+  <provider/model2>:
+  <summary2>
+  <provider/model3>:
+  <summary3>
+  ```
+  This attributed format also serves as a record of which models were used in
+  the review.
+Return the merged result as the final JSON response.
+**Do NOT:**
+- Deduplicate suggestions — let the consumer decide what overlaps
+- Synthesize, summarize, or editorialize on the combined results
+- Perform any review analysis yourself
+**Do:**
+- Concatenate all suggestion arrays: `[...model1, ...model2, ...model3]`
+- Concatenate all summaries with `provider/model:` attribution as shown above
+- Return the merged result as the final JSON response in the same schema the
+  subtasks used
+## Summary
+```
+You (parent)                    Subtask 1 (model A)
+    │                               │
+    ├── pick 3 random models        ├── receive full prompt
+    ├── forward full prompt ──────► ├── perform review
+    │                               └── return suggestions JSON
+    │
+    ├── forward full prompt ──────► Subtask 2 (model B) ──► suggestions JSON
+    │
+    ├── forward full prompt ──────► Subtask 3 (model C) ──► suggestions JSON
+    │
+    └── merge all summaries (with model attribution) + suggestions[] ──► final JSON response
+```
+The parent does zero analysis. It is purely a dispatcher and merger.
+Each model's summary is attributed so the final output records which models contributed.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@in-the-loop-labs/pair-review",
-  "version": "2.0.2",
+  "version": "2.0.3",
   "description": "Your AI-powered code review partner - Close the feedback loop with AI coding agents",
   "main": "src/server.js",
   "bin": {
@@ -8,6 +8,7 @@
     "git-diff-lines": "bin/git-diff-lines"
   },
   "files": [
+    ".pi/",
     "bin/",
     "src/",
     "public/",

package/plugin/.claude-plugin/plugin.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "pair-review",
-  "version": "2.0.2",
+  "version": "2.0.3",
   "description": "pair-review app integration — Open PRs and local changes in the pair-review web UI, run server-side AI analysis, and address review feedback. Requires the pair-review MCP server.",
   "author": {
     "name": "in-the-loop-labs",

package/plugin-code-critic/.claude-plugin/plugin.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "code-critic",
-  "version": "2.0.2",
+  "version": "2.0.3",
   "description": "AI-powered code review analysis — Run three-level AI analysis and implement-review-fix loops directly in your coding agent. Works standalone, no server required.",
   "author": {
     "name": "in-the-loop-labs",