@qwen-code/qwen-code 0.15.6 → 0.15.7-preview.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -481,6 +481,69 @@ When using a raw model via `--model gpt-4` (not from modelProviders, creates a R
481
481
 
482
482
  The merge strategy for `modelProviders` itself is REPLACE: the entire `modelProviders` from project settings will override the corresponding section in user settings, rather than merging the two.
483
483
 
484
+ ## Reasoning / thinking configuration
485
+
486
+ The optional `reasoning` field under `generationConfig` controls how aggressively the model reasons before responding. The Anthropic and Gemini converters always honor it. The OpenAI-compatible pipeline honors it **unless** `generationConfig.samplingParams` is set — see the "Interaction with `samplingParams`" caveat below.
487
+
488
+ ```jsonc
489
+ {
490
+ "modelProviders": {
491
+ "openai": [
492
+ {
493
+ "id": "deepseek-v4-pro",
494
+ "name": "DeepSeek V4 Pro",
495
+ "baseUrl": "https://api.deepseek.com/v1",
496
+ "envKey": "DEEPSEEK_API_KEY",
497
+ "generationConfig": {
498
+ // The four-tier scale:
499
+ // 'low' | 'medium' — server-mapped to 'high' on DeepSeek
500
+ // 'high' — default reasoning intensity
501
+ // 'max' — DeepSeek-specific extra-strong tier
502
+ // Or set `false` to disable reasoning entirely.
503
+ "reasoning": { "effort": "max" },
504
+ },
505
+ },
506
+ ],
507
+ },
508
+ }
509
+ ```
510
+
511
+ ### Per-provider behavior
512
+
513
+ | Protocol / provider | Wire shape | Notes |
514
+ | -------------------------------------------- | -------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
515
+ | **OpenAI / DeepSeek** (`api.deepseek.com`) | Flat `reasoning_effort: <effort>` body parameter | When `reasoning.effort` is set in the nested config shape, it's rewritten to flat `reasoning_effort` and `'low'`/`'medium'` are normalized to `'high'`, `'xhigh'` to `'max'` — mirroring DeepSeek's [server-side back-compat](https://api-docs.deepseek.com/zh-cn/api/create-chat-completion). Top-level `samplingParams.reasoning_effort` or `extra_body.reasoning_effort` overrides skip this normalization and ship verbatim. |
516
+ | **OpenAI** (other compatible servers) | `reasoning: { effort, ... }` passed through verbatim | Set via `samplingParams` (e.g. `samplingParams.reasoning_effort` for GPT-5/o-series) when the provider expects a different shape. |
517
+ | **Anthropic** (real `api.anthropic.com`) | `output_config: { effort }` plus the `effort-2025-11-24` beta header | Real Anthropic accepts `'low'`/`'medium'`/`'high'` only. `'max'` is **clamped to `'high'`** with a `debugLogger.warn` line (once per generator); if you want max effort, switch the baseURL to a DeepSeek-compatible endpoint that supports it. |
518
+ | **Anthropic** (`api.deepseek.com/anthropic`) | Same `output_config: { effort }` + beta header | `'max'` is passed through unchanged. |
519
+ | **Gemini** (`@google/genai`) | `thinkingConfig: { includeThoughts: true, thinkingLevel }` | `'low'` → `LOW`, `'high'`/`'max'` → `HIGH`, others → `THINKING_LEVEL_UNSPECIFIED` (Gemini has no `MAX` tier). |
520
+
521
+ ### `reasoning: false`
522
+
523
+ Setting `reasoning: false` (the literal boolean) explicitly disables thinking on every provider — useful for cheap side queries that don't benefit from reasoning. This is honored at the request level too via `request.config.thinkingConfig.includeThoughts: false` for one-off calls (e.g. suggestion generation).
524
+
525
+ On a `api.deepseek.com` baseURL, the OpenAI pipeline emits the explicit `thinking: { type: 'disabled' }` field that DeepSeek V4+ requires — the server-side default is `'enabled'`, so simply omitting `reasoning_effort` would still pay thinking latency/cost. Self-hosted DeepSeek backends (sglang/vllm) and other OpenAI-compatible servers do **not** receive this field; if you need to disable thinking on those, inject `thinking: { type: 'disabled' }` (or whatever knob your inference framework exposes) via `samplingParams`/`extra_body`.
526
+
527
+ ### Interaction with `samplingParams` (OpenAI-compatible only)
528
+
529
+ > [!warning]
530
+ >
531
+ > When `generationConfig.samplingParams` is set on an OpenAI-compatible provider, the pipeline ships those keys to the wire **verbatim** and skips the separate `reasoning` injection entirely. So a config like `{ samplingParams: { temperature: 0.5 }, reasoning: { effort: 'max' } }` will silently drop the reasoning field on OpenAI/DeepSeek requests.
532
+ >
533
+ > If you set `samplingParams`, include the reasoning knob inside it directly — for DeepSeek that's `samplingParams.reasoning_effort`, for GPT-5/o-series it's `samplingParams.reasoning_effort` (their flat field) or `samplingParams.reasoning` (the nested object). For OpenRouter and other providers the field name varies; consult the provider docs.
534
+ >
535
+ > The Anthropic and Gemini converters are unaffected — they always read `reasoning.effort` directly regardless of `samplingParams`.
536
+
537
+ ### `budget_tokens`
538
+
539
+ You can pin an exact thinking-token budget by including `budget_tokens` alongside `effort`:
540
+
541
+ ```jsonc
542
+ "reasoning": { "effort": "high", "budget_tokens": 50000 }
543
+ ```
544
+
545
+ For Anthropic this becomes `thinking.budget_tokens`. For OpenAI/DeepSeek the field is preserved but currently ignored by the server — `reasoning_effort` is the load-bearing knob.
546
+
484
547
  ## Provider Models vs Runtime Models
485
548
 
486
549
  Qwen Code distinguishes between two types of model configurations:
@@ -73,7 +73,13 @@ When both legacy settings are present with different values, the migration follo
73
73
 
74
74
  ### Available settings in `settings.json`
75
75
 
76
- Settings are organized into categories. All settings should be placed within their corresponding top-level category object in your `settings.json` file.
76
+ Settings are organized into categories. Most settings should be placed within their corresponding top-level category object in your `settings.json` file. A few compatibility settings, such as `proxy`, are top-level keys.
77
+
78
+ #### top-level
79
+
80
+ | Setting | Type | Description | Default |
81
+ | ------- | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------- |
82
+ | `proxy` | string | Proxy URL for CLI HTTP requests. Precedence is `--proxy` > `proxy` in `settings.json` > `HTTPS_PROXY` / `https_proxy` / `HTTP_PROXY` / `http_proxy` environment variables. | `undefined` |
77
83
 
78
84
  #### general
79
85
 
@@ -134,17 +140,17 @@ Settings are organized into categories. All settings should be placed within the
134
140
 
135
141
  #### model
136
142
 
137
- | Setting | Type | Description | Default |
138
- | -------------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------- |
139
- | `model.name` | string | The Qwen model to use for conversations. | `undefined` |
140
- | `model.maxSessionTurns` | number | Maximum number of user/model/tool turns to keep in a session. -1 means unlimited. | `-1` |
141
- | `model.generationConfig` | object | Advanced overrides passed to the underlying content generator. Supports request controls such as `timeout`, `maxRetries`, `enableCacheControl`, `splitToolMedia` (set `true` for strict OpenAI-compatible servers like LM Studio that reject non-text content on `role: "tool"` messages — splits media into a follow-up user message), `contextWindowSize` (override model's context window size), `modalities` (override auto-detected input modalities), `customHeaders` (custom HTTP headers for API requests), and `extra_body` (additional body parameters for OpenAI-compatible API requests only), along with fine-tuning knobs under `samplingParams` (for example `temperature`, `top_p`, `max_tokens`). Leave unset to rely on provider defaults. | `undefined` |
142
- | `model.chatCompression.contextPercentageThreshold` | number | Sets the threshold for chat history compression as a percentage of the model's total token limit. This is a value between 0 and 1 that applies to both automatic compression and the manual `/compress` command. For example, a value of `0.6` will trigger compression when the chat history exceeds 60% of the token limit. Use `0` to disable compression entirely. | `0.7` |
143
- | `model.skipNextSpeakerCheck` | boolean | Skip the next speaker check. | `false` |
144
- | `model.skipLoopDetection` | boolean | Disables loop detection checks. Loop detection prevents infinite loops in AI responses but can generate false positives that interrupt legitimate workflows. Enable this option if you experience frequent false positive loop detection interruptions. | `false` |
145
- | `model.skipStartupContext` | boolean | Skips sending the startup workspace context (environment summary and acknowledgement) at the beginning of each session. Enable this if you prefer to provide context manually or want to save tokens on startup. | `false` |
146
- | `model.enableOpenAILogging` | boolean | Enables logging of OpenAI API calls for debugging and analysis. When enabled, API requests and responses are logged to JSON files. | `false` |
147
- | `model.openAILoggingDir` | string | Custom directory path for OpenAI API logs. If not specified, defaults to `logs/openai` in the current working directory. Supports absolute paths, relative paths (resolved from current working directory), and `~` expansion (home directory). | `undefined` |
143
+ | Setting | Type | Description | Default |
144
+ | -------------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------- |
145
+ | `model.name` | string | The Qwen model to use for conversations. | `undefined` |
146
+ | `model.maxSessionTurns` | number | Maximum number of user/model/tool turns to keep in a session. -1 means unlimited. | `-1` |
147
+ | `model.generationConfig` | object | Advanced overrides passed to the underlying content generator. Supports request controls such as `timeout`, `maxRetries`, `enableCacheControl`, `splitToolMedia` (set `true` for strict OpenAI-compatible servers like LM Studio that reject non-text content on `role: "tool"` messages — splits media into a follow-up user message), `contextWindowSize` (override model's context window size), `modalities` (override auto-detected input modalities), `customHeaders` (custom HTTP headers for API requests), `extra_body` (additional body parameters for OpenAI-compatible API requests only), and `reasoning` (`{ effort: 'low' \| 'medium' \| 'high' \| 'max', budget_tokens?: number }` to control thinking intensity, or `false` to disable; `'max'` is a DeepSeek extension — see [Reasoning / thinking configuration](./model-providers.md#reasoning--thinking-configuration) for per-provider behavior. **Note:** when `samplingParams` is set on an OpenAI-compatible provider, the pipeline ships those keys verbatim and the separate top-level `reasoning` field is dropped — put `reasoning_effort` inside `samplingParams` (or `extra_body`) instead in that case), along with fine-tuning knobs under `samplingParams` (for example `temperature`, `top_p`, `max_tokens`). Leave unset to rely on provider defaults. | `undefined` |
148
+ | `model.chatCompression.contextPercentageThreshold` | number | Sets the threshold for chat history compression as a percentage of the model's total token limit. This is a value between 0 and 1 that applies to both automatic compression and the manual `/compress` command. For example, a value of `0.6` will trigger compression when the chat history exceeds 60% of the token limit. Use `0` to disable compression entirely. | `0.7` |
149
+ | `model.skipNextSpeakerCheck` | boolean | Skip the next speaker check. | `false` |
150
+ | `model.skipLoopDetection` | boolean | Disables loop detection checks. Loop detection prevents infinite loops in AI responses but can generate false positives that interrupt legitimate workflows. Enable this option if you experience frequent false positive loop detection interruptions. | `false` |
151
+ | `model.skipStartupContext` | boolean | Skips sending the startup workspace context (environment summary and acknowledgement) at the beginning of each session. Enable this if you prefer to provide context manually or want to save tokens on startup. | `false` |
152
+ | `model.enableOpenAILogging` | boolean | Enables logging of OpenAI API calls for debugging and analysis. When enabled, API requests and responses are logged to JSON files. | `false` |
153
+ | `model.openAILoggingDir` | string | Custom directory path for OpenAI API logs. If not specified, defaults to `logs/openai` in the current working directory. Supports absolute paths, relative paths (resolved from current working directory), and `~` expansion (home directory). | `undefined` |
148
154
 
149
155
  **Example model.generationConfig:**
150
156
 
@@ -470,6 +476,7 @@ Here is an example of a `settings.json` file with the nested structure, new as o
470
476
 
471
477
  ```
472
478
  {
479
+ "proxy": "http://localhost:7890",
473
480
  "general": {
474
481
  "vimMode": true,
475
482
  "preferredEditor": "code"
@@ -29,14 +29,16 @@ The `/review` command runs a multi-stage pipeline:
29
29
  Step 1: Determine scope (local diff / PR worktree / file)
30
30
  Step 2: Load project review rules
31
31
  Step 3: Run deterministic analysis (linter, typecheck) [zero LLM cost]
32
- Step 4: 5 parallel review agents [5 LLM calls]
33
- |-- Agent 1: Correctness & Security
34
- |-- Agent 2: Code Quality
35
- |-- Agent 3: Performance & Efficiency
36
- |-- Agent 4: Undirected Audit
37
- '-- Agent 5: Build & Test (runs shell commands)
32
+ Step 4: 9 parallel review agents [9 LLM calls]
33
+ |-- Agent 1: Correctness
34
+ |-- Agent 2: Security
35
+ |-- Agent 3: Code Quality
36
+ |-- Agent 4: Performance & Efficiency
37
+ |-- Agent 5: Test Coverage
38
+ |-- Agent 6: Undirected Audit (3 personas: 6a/6b/6c)
39
+ '-- Agent 7: Build & Test (runs shell commands)
38
40
  Step 5: Deduplicate --> Batch verify --> Aggregate [1 LLM call]
39
- Step 6: Reverse audit (find coverage gaps) [1 LLM call]
41
+ Step 6: Iterative reverse audit (1-3 rounds, gap finding) [1-3 LLM calls]
40
42
  Step 7: Present findings + verdict
41
43
  Step 8: Autofix (user-confirmed, optional)
42
44
  Step 9: Post PR inline comments (if requested)
@@ -46,15 +48,17 @@ Step 11: Clean up (remove worktree + temp files)
46
48
 
47
49
  ### Review Agents
48
50
 
49
- | Agent | Focus |
50
- | --------------------------------- | ------------------------------------------------------------------ |
51
- | Agent 1: Correctness & Security | Logic errors, null handling, race conditions, injection, XSS, SSRF |
52
- | Agent 2: Code Quality | Style consistency, naming, duplication, dead code |
53
- | Agent 3: Performance & Efficiency | N+1 queries, memory leaks, unnecessary re-renders, bundle size |
54
- | Agent 4: Undirected Audit | Business logic, boundary interactions, hidden coupling |
55
- | Agent 5: Build & Test | Runs build and test commands, reports failures |
51
+ | Agent | Focus |
52
+ | --------------------------------- | ------------------------------------------------------------------------------------------- |
53
+ | Agent 1: Correctness | Logic errors, edge cases, null handling, race conditions, type safety |
54
+ | Agent 2: Security | Injection, XSS, SSRF, auth bypass, sensitive data exposure |
55
+ | Agent 3: Code Quality | Style consistency, naming, duplication, dead code |
56
+ | Agent 4: Performance & Efficiency | N+1 queries, memory leaks, unnecessary re-renders, bundle size |
57
+ | Agent 5: Test Coverage | Untested code paths in the diff, missing branch coverage, weak assertions |
58
+ | Agent 6: Undirected Audit | 3 parallel personas (attacker / 3am-oncall / maintainer) — catches cross-dimensional issues |
59
+ | Agent 7: Build & Test | Runs build and test commands, reports failures |
56
60
 
57
- All agents run in parallel. Findings from Agents 1-4 are verified in a **single batch verification pass** (one agent reviews all findings at once, keeping LLM calls fixed). After verification, a **reverse audit agent** re-reads the entire diff with knowledge of all confirmed findings to catch issues that every other agent missed. Reverse audit findings skip the verification step (the agent already has full context) and are included directly as high-confidence results.
61
+ All agents run in parallel (Agent 6 launches 3 persona variants concurrently, totaling 9 parallel tasks for same-repo reviews). Findings from Agents 1-6 are verified in a **single batch verification pass** (one agent reviews all findings at once, keeping verification cost fixed regardless of finding count). After verification, **iterative reverse audit** runs 1-3 rounds of gap-finding — each round receives the cumulative finding list from prior rounds, so successive rounds focus on whatever's left undiscovered. The loop stops as soon as a round returns "No issues found", or after 3 rounds (hard cap). Reverse audit findings skip verification (the agent already has full context) and are included as high-confidence results.
58
62
 
59
63
  ## Deterministic Analysis
60
64
 
@@ -127,8 +131,8 @@ This runs in **lightweight mode** — no worktree, no linter, no build/test, no
127
131
 
128
132
  | Capability | Same-repo | Cross-repo |
129
133
  | ------------------------------------------------ | --------- | ----------------------------- |
130
- | LLM review (Agents 1-4 + verify + reverse audit) | ✅ | ✅ |
131
- | Agent 5: Build & test | ✅ | ❌ (no local codebase) |
134
+ | LLM review (Agents 1-6 + verify + iterative reverse audit) | ✅ | ✅ |
135
+ | Agent 7: Build & test | ✅ | ❌ (no local codebase) |
132
136
  | Deterministic analysis (linter/typecheck) | ✅ | ❌ |
133
137
  | Cross-file impact analysis | ✅ | ❌ |
134
138
  | Autofix | ✅ | ❌ |
@@ -157,6 +161,12 @@ Or, after running `/review 123`, type `post comments` to publish findings withou
157
161
  - Nice to have findings (including linter warnings)
158
162
  - Low-confidence findings
159
163
 
164
+ **Self-authored PRs:** GitHub does not allow you to submit `APPROVE` or `REQUEST_CHANGES` reviews on your own pull request — both fail with HTTP 422. When `/review` detects that the PR author matches the current authenticated user, it automatically downgrades the API event to `COMMENT` regardless of verdict, so the submission still succeeds. The terminal still shows the honest verdict ("Approve" / "Request changes" / "Comment") — only the GitHub-side review event is neutralized. The actual findings still appear as inline comments on specific lines, so substantive feedback is unchanged.
165
+
166
+ **Re-reviewing a PR with prior Qwen Code comments:** when `/review` runs on a PR that already has previous Qwen Code review comments, it classifies them before posting new ones. Only **same-line overlap** (an existing comment on the same `(path, line)` as a new finding) prompts you to confirm — that's the case where you'd see a visual duplicate on the same code line. Comments from older commits, replied-to comments (treated as resolved), and comments that simply don't overlap with any new finding are silently skipped, with a terminal log line so you know what was filtered.
167
+
168
+ **CI / build status check before APPROVE:** if the verdict is "Approve", `/review` queries the PR's check-runs and commit statuses before submitting. If any check has failed (or all checks are still pending), the API event is automatically downgraded from `APPROVE` to `COMMENT`, with the review body explaining why. Rationale: the LLM review reads code statically and cannot see runtime test failures; approving while CI is red would be misleading. The inline findings are still posted unchanged. If you want to approve anyway (e.g., a known-flaky CI failure), submit the GitHub approval manually after verifying.
169
+
160
170
  ## Follow-up Actions
161
171
 
162
172
  After the review, context-aware tips appear as ghost text. Press Tab to accept:
@@ -179,7 +189,7 @@ You can customize review criteria per project. `/review` reads rules from these
179
189
  3. `AGENTS.md` — `## Code Review` section
180
190
  4. `QWEN.md` — `## Code Review` section
181
191
 
182
- Rules are injected into the LLM review agents (1-4) as additional criteria. For PR reviews, rules are read from the **base branch** to prevent a malicious PR from injecting bypass rules.
192
+ Rules are injected into the LLM review agents (1-6) as additional criteria. For PR reviews, rules are read from the **base branch** to prevent a malicious PR from injecting bypass rules.
183
193
 
184
194
  Example `.qwen/review-rules.md`:
185
195
 
@@ -246,15 +256,17 @@ For large diffs (>10 modified symbols), analysis prioritizes functions with sign
246
256
 
247
257
  ## Token Efficiency
248
258
 
249
- The review pipeline uses a fixed number of LLM calls regardless of how many findings are produced:
259
+ The review pipeline uses a bounded number of LLM calls regardless of how many findings are produced:
260
+
261
+ | Stage | LLM calls | Notes |
262
+ | -------------------------------- | ----------------- | ---------------------------------------------------- |
263
+ | Deterministic analysis (Step 3) | 0 | Shell commands only |
264
+ | Review agents (Step 4) | 9 (or 8) | Run in parallel; Agent 7 skipped in cross-repo mode |
265
+ | Batch verification (Step 5) | 1 | Single agent verifies all findings at once |
266
+ | Iterative reverse audit (Step 6) | 1-3 | Loops until "No issues found" or 3-round cap |
267
+ | **Total** | **11-13 (10-12)** | Same-repo: 11-13; cross-repo: 10-12 (no Agent 7) |
250
268
 
251
- | Stage | LLM calls | Notes |
252
- | ------------------------------- | ---------- | --------------------------------------------------- |
253
- | Deterministic analysis (Step 3) | 0 | Shell commands only |
254
- | Review agents (Step 4) | 5 (or 4) | Run in parallel; Agent 5 skipped in cross-repo mode |
255
- | Batch verification (Step 5) | 1 | Single agent verifies all findings at once |
256
- | Reverse audit (Step 6) | 1 | Finds coverage gaps; findings skip verification |
257
- | **Total** | **7 or 6** | Same-repo: 7; cross-repo: 6 (no Agent 5) |
269
+ Most PRs converge to the lower end of the range (1 reverse audit round); the cap prevents runaway cost on pathological cases.
258
270
 
259
271
  ## What's NOT Flagged
260
272
 
@@ -89,14 +89,35 @@ Show concrete examples of using this Skill.
89
89
 
90
90
  Qwen Code currently validates that:
91
91
 
92
- - `name` is a non-empty string
92
+ - `name` is a non-empty string matching `/^[\p{L}\p{N}_:.-]+$/u` — Unicode letters and digits (CJK / Cyrillic / accented Latin all OK), plus `_`, `:`, `.`, `-`. Whitespace, slashes, brackets and other structurally unsafe characters are rejected at parse time.
93
93
  - `description` is a non-empty string
94
94
 
95
- Recommended conventions (not strictly enforced yet):
95
+ Recommended conventions:
96
96
 
97
- - Use lowercase letters, numbers, and hyphens in `name`
97
+ - Prefer lowercase ASCII with hyphens for shareable names (e.g. `tsx-helper`)
98
98
  - Make `description` specific: include both **what** the Skill does and **when** to use it (key words users will naturally mention)
99
99
 
100
+ ### Optional: gate a Skill on file paths (`paths:`)
101
+
102
+ For Skills that only matter to specific parts of a codebase, add a `paths:` list of glob patterns. The Skill stays out of the model's available-skills listing until a tool call touches a matching file:
103
+
104
+ ```yaml
105
+ ---
106
+ name: tsx-helper
107
+ description: React TSX component helper
108
+ paths:
109
+ - 'src/**/*.tsx'
110
+ - 'packages/*/src/**/*.tsx'
111
+ ---
112
+ ```
113
+
114
+ Notes:
115
+
116
+ - Globs are matched relative to the project root with [picomatch](https://github.com/micromatch/picomatch); files outside the project root never trigger activation.
117
+ - A path-gated Skill **stays activated for the rest of the session** once a matching file is touched. A new session, or a `refreshCache` triggered by editing any Skill file, resets activations.
118
+ - `paths:` only gates **model** discovery, and only at the SkillTool listing level. You can always invoke a path-gated Skill yourself via `/<skill-name>` or the `/skills` picker — that user path runs the Skill body regardless of activation state. The model side, however, stays gated until a matching file is touched: a slash invocation does **not** unlock model-side activation, so if you want the model to chain off your invocation (call `Skill { skill: ... }` itself), also access a file matching the skill's `paths:` first.
119
+ - Combining `paths:` with `disable-model-invocation: true` is allowed but the gate has no effect — the Skill is hidden from the model regardless, so path activation never advertises it.
120
+
100
121
  ## Add supporting files
101
122
 
102
123
  Create additional files alongside `SKILL.md`:
@@ -146,6 +167,14 @@ To view available Skills, ask Qwen Code directly:
146
167
  What Skills are available?
147
168
  ```
148
169
 
170
+ > **Heads up — model vs. user view.** Asking the model only surfaces Skills the model can currently see. If a Skill uses `paths:` (see "Optional: gate a Skill on file paths" above), it stays out of that listing until a matching file has been touched. The full set is always visible to you via the `/skills` slash command and on disk.
171
+
172
+ Or browse the full list with the slash command (always shows every Skill, including path-gated ones that have not activated yet):
173
+
174
+ ```text
175
+ /skills
176
+ ```
177
+
149
178
  Or inspect the filesystem:
150
179
 
151
180
  ```bash
@@ -2,20 +2,35 @@
2
2
 
3
3
  > Architecture decisions, trade-offs, and rejected alternatives for the `/review` skill.
4
4
 
5
- ## Why 5 agents + 1 verify + 1 reverse, not 1 agent?
5
+ ## Why 9 agents + 1 verify + iterative reverse, not 1 agent?
6
6
 
7
7
  **Considered:**
8
8
 
9
9
  - **1 agent (Copilot approach):** Single agent with tool-calling, reads and reviews in one pass. Cheapest (1 LLM call). But dimensional coverage depends entirely on one prompt's attention — easy to miss performance issues while focused on security.
10
- - **5 parallel agents (chosen):** Each agent focuses on one dimension. Higher coverage through forced diversity of perspective. Cost: 5 LLM calls, but they run in parallel so wall-clock time is similar to 1 agent.
10
+ - **5 parallel agents (original design):** Each agent focuses on one dimension. Higher coverage through forced diversity of perspective. Limited by combined Correctness+Security and a single undirected pass recall ceiling left findings on the table that the user only discovered in subsequent /review rounds.
11
+ - **9 parallel agents (current):** 6 review dimensions (Correctness, Security, Code Quality, Performance, Test Coverage, Undirected) + Build & Test. Undirected runs as 3 personas in parallel.
11
12
 
12
- **Decision:** 5 agents. The marginal cost (5x vs 1x) is acceptable because:
13
+ **Decision:** 9 agents. The marginal cost (9x vs 1x) is acceptable because:
13
14
 
14
- 1. Parallel execution means time cost is ~1x (all 5 agents must launch in one response)
15
+ 1. Parallel execution means time cost is ~1x (all 9 agents launch in one response)
15
16
  2. Dimensional focus produces higher recall (fewer missed issues)
16
- 3. Agent 4 (Undirected Audit) catches cross-dimensional issues
17
+ 3. Three undirected personas (attacker / 3am-oncall / maintainer) catch cross-dimensional issues that a single undirected agent's prompt-induced bias would miss
17
18
  4. The "Silence is better than noise" principle + verification controls precision
18
19
 
20
+ ### Why split Correctness from Security
21
+
22
+ A single Correctness+Security agent has split attention — empirically one dimension dominates the output and the other is shallow. Different mindsets too: correctness asks "does this do what it intends," security asks "what unintended thing can a hostile actor make this do." Splitting forces both to get full attention.
23
+
24
+ ### Why a dedicated Test Coverage agent
25
+
26
+ Test gaps are a systematic blind spot. Review agents focused on bugs in the new code itself rarely look at whether the change came with adequate tests. A dedicated agent that asks "what scenarios in this diff are untested?" catches misses no other dimension hits.
27
+
28
+ ### Why three undirected personas instead of one or many
29
+
30
+ A single undirected agent has prompt-induced bias and tends to find the same kinds of issues across runs. Three personas — attacker / 3am-oncall / maintainer — force completely different mental traversals, and the union of findings is meaningfully larger than 1.5× a single agent.
31
+
32
+ Empirically, ensemble diversity drops sharply past 3-5 sampled paths. Three is the sweet spot: enough to break single-prompt bias, few enough that the marginal cost stays bounded.
33
+
19
34
  ## Why batch verification instead of N independent agents?
20
35
 
21
36
  **Considered:**
@@ -25,16 +40,46 @@
25
40
 
26
41
  **Decision:** Batch. The quality difference is minimal — a single agent verifying 15 findings has MORE context than 15 independent agents (sees cross-finding relationships). Cost drops from O(N) to O(1).
27
42
 
28
- ## Why reverse audit is a separate step, not merged with verification
43
+ ## Why reverse audit is a separate step, and why iterative
29
44
 
30
- **Considered:**
45
+ ### Why separate from verification
31
46
 
32
47
  - **Merge with verification:** Verification agent also looks for gaps. Saves 1 LLM call.
33
48
  - **Separate step (chosen):** Reverse audit is a full diff re-read, not a finding check. Different cognitive task.
34
49
 
35
- **Decision:** Separate. Verification is targeted (check specific claims at specific locations). Reverse audit is open-ended (scan entire diff for missed issues). Combining overloads one agent with two fundamentally different tasks, degrading both.
50
+ Verification is targeted (check specific claims at specific locations). Reverse audit is open-ended (scan entire diff for missed issues). Combining overloads one agent with two fundamentally different tasks, degrading both.
51
+
52
+ ### Why iterative (multi-round)
53
+
54
+ A single reverse audit pass leaves whatever the reverse audit agent itself missed. Each new round receives the cumulative finding list from prior rounds, so it focuses on what's left undiscovered. Empirically, most PRs converge in 1-2 rounds; the 3-round hard cap prevents runaway cost on pathological cases.
55
+
56
+ ### Why cap at 3 rounds, not unlimited
57
+
58
+ Diminishing returns. Past round 3, the marginal yield is low and a stuck-loop hazard rises (the model may fabricate issues to satisfy the "find more" framing). The "No issues found" termination already exits early on most PRs — the cap is a safety net, not the common path.
36
59
 
37
- **Optimization:** Reverse audit findings skip verification. The reverse audit agent already has full context (all confirmed findings + entire diff), so its output is inherently high-confidence. This keeps total calls at 7, not 8.
60
+ **Optimization preserved:** Reverse audit findings skip verification (across all rounds). The agent has full context, so output is inherently high-confidence.
61
+
62
+ ## Why low-confidence over rejection on uncertain findings
63
+
64
+ **Original behavior:** When verification was uncertain, it would reject. Bias toward precision.
65
+
66
+ **Problem:** Uncertain findings often turn out to be real after human inspection. Rejection silently swallows valid concerns. Users discover them in the next iteration of /review or after merging — exactly the "iterate many rounds" pain this redesign targets.
67
+
68
+ **Current behavior:** Uncertain → "confirmed (low confidence)". Low-confidence findings:
69
+
70
+ - Appear in terminal output under "Needs Human Review"
71
+ - Are filtered out of PR inline comments (preserves "Silence is better than noise" for PR interactions)
72
+ - Do not affect the verdict (Approve/Request changes/Comment is computed from high-confidence findings only)
73
+
74
+ **Trade-off:** Terminal output gets noisier. PR comments stay clean. The user sees concerns without the cost of false-positive PR noise.
75
+
76
+ **Reserved for outright rejection:**
77
+
78
+ - Finding describes behavior the code does not actually have (factually wrong about the code)
79
+ - Finding matches an Exclusion Criterion (pre-existing issue, formatting nitpick, etc.)
80
+ - Vague suspicion with no concrete code reference
81
+
82
+ This boundary keeps the low-confidence bucket meaningful — it's "likely real but needs human judgment," not "I have no idea."
38
83
 
39
84
  ## Why worktree instead of stash + checkout
40
85
 
@@ -59,6 +104,76 @@ Applied throughout:
59
104
  - Uncertain issues → rejected, not reported
60
105
  - Pattern aggregation → same issue across N files reported once
61
106
 
107
+ ## Why classify existing Qwen Code comments instead of always prompting
108
+
109
+ **Original behavior:** any existing Qwen Code review comment on the PR → inform the user and require confirmation before posting new comments.
110
+
111
+ **Problem:** in real /review usage, most existing Qwen Code comments fall into one of three "no-real-conflict" cases:
112
+
113
+ 1. **Stale by commit**: the comment was posted against an older PR HEAD; the underlying code has changed.
114
+ 2. **Resolved by reply**: someone has replied in the thread (the original author "fixed in abc123" or a reviewer "ok, approved"). The conversation is closed.
115
+ 3. **No anchor overlap**: the old comment is on a different `(path, line)` from any new finding. They simply coexist.
116
+
117
+ Forcing the user to confirm-or-decline every time the PR has any Qwen Code history creates prompt fatigue without protecting against the real risk — which is **commenting twice on the same line**, producing visual duplicates that look like a bug to PR readers.
118
+
119
+ **New behavior:** classify each existing Qwen Code comment by checking in priority order — **Stale by commit** > **Resolved by reply** > **Overlap** (same `path + line` as a new finding) > **No conflict**. The first match wins. Only the Overlap class blocks; the other three log to the terminal and continue.
120
+
121
+ **Priority matters because** a stale or resolved comment that happens to share a `(path, line)` with a new finding is not a real conflict — the underlying code may have changed in the stale case, and the conversation is already closed in the resolved case. Without priority, the line-based check would fire false-positive prompts on those.
122
+
123
+ **Trade-off:**
124
+ - ✅ Common case (re-running /review on a PR after a few new commits) no longer prompts unnecessarily.
125
+ - ✅ The terminal log keeps the user informed about what was skipped, so transparency is preserved.
126
+ - ❌ Conceptual overlap that doesn't share a line is missed — e.g. a prior comment on line 559 about cache lifecycle and a new comment on line 1352 about cache lifecycle would be classified `No conflict`. Line-based heuristics cannot detect "same root cause, different anchor." If the user wants semantic-overlap detection, they must read the terminal log and the PR comments themselves.
127
+
128
+ Line-based classification was chosen because it's deterministic, cheap, and catches the precise UX failure (visual duplicate at the same line). Semantic overlap detection would require an extra LLM call for what is, in practice, a rare edge case.
129
+
130
+ ## Why downgrade APPROVE when CI is non-green
131
+
132
+ **Original behavior:** if Step 7 resolved verdict to `APPROVE`, the API event was submitted as `APPROVE` without any check on CI status.
133
+
134
+ **Problem:** the LLM review pipeline reads the diff and surrounding code statically. It does not run tests, does not exercise integration boundaries, and does not see runtime failures. CI does. A PR with red CI but no static red flags is **the worst case** for an LLM `APPROVE` — the human reader sees an Approve badge from a tool that didn't actually verify the change runs.
135
+
136
+ **Current behavior:** before submitting `APPROVE`, query `check-runs` and legacy commit `statuses` for the PR HEAD. Classify:
137
+
138
+ - All success → `APPROVE` continues.
139
+ - Any failure → downgrade `APPROVE` to `COMMENT`, body explains.
140
+ - All pending → downgrade to `COMMENT` (don't approve before CI decides), body explains.
141
+
142
+ **Why downgrade rather than block:** the reviewer LLM has done substantive work; throwing the review away because CI is red wastes that. Downgrading to `COMMENT` keeps all inline findings, preserves the static review value, and lets GitHub's check status carry the "do not merge" signal naturally.
143
+
144
+ **Why this stacks with self-PR downgrade:** a self-authored PR with red CI hits **both** downgrade rules. The event is `COMMENT` either way, so stacking is operationally a no-op — but the body should mention both reasons so a future maintainer reading the review knows why an LLM that found no Critical issues did not approve.
145
+
146
+ **Trade-off:**
147
+ - ✅ No more "LLM approved while CI is red" embarrassments.
148
+ - ✅ Reviewer's substantive work (inline comments) is preserved.
149
+ - ❌ Adds two extra API calls (`check-runs` + `statuses`) per APPROVE-bound submit; only relevant for the `APPROVE` path so the cost is negligible.
150
+ - ❌ A genuinely flaky CI failure can downgrade what should have been an Approve. Mitigation: the body text directs the user to verify; they can always submit `APPROVE` manually after triaging.
151
+
152
+ ## Why the deterministic checks live as `qwen review` subcommands
153
+
154
+ **Original behavior:** Step 9's three pre-submission checks (self-PR detection, CI status, existing-comment classification) and Step 11's cleanup were inlined in SKILL.md as `gh api` / `git` shell commands. The LLM ran each command itself, parsed the output, and applied the classification logic.
155
+
156
+ **Problems with inlining:**
157
+
158
+ 1. **Token cost**: each command, jq filter, classification rule, and output schema is part of the prompt — every `/review` invocation pays this cost.
159
+ 2. **Drift risk**: the classification logic exists twice (in the prompt's English description, and in whatever the LLM internally synthesizes). When rules change (new check_run conclusion type, new comment bucket), both have to update or they drift.
160
+ 3. **Cross-platform fragility**: `/tmp/qwen-review-*` worked on macOS shell but Node's `os.tmpdir()` returned `/var/folders/...`. The mismatch only surfaced when the cleanup logic was tested.
161
+ 4. **Testability**: prompt text isn't unit-testable. Logic that classifies CI states or comment buckets is the kind of thing that benefits from real assertions.
162
+
163
+ **Current behavior:** the deterministic logic lives in `packages/cli/src/commands/review/` as TypeScript subcommands of the `qwen` CLI:
164
+
165
+ - `qwen review presubmit <pr> <sha> <owner/repo> <out>` — emits a single JSON report with `isSelfPr`, `ciStatus`, `existingComments` (4 buckets), `downgradeApprove`, `downgradeRequestChanges`, `downgradeReasons`, `blockOnExistingComments`. SKILL.md only describes the schema and how to apply the report.
166
+ - `qwen review cleanup <target>` — removes the worktree, branch ref, and per-target temp files. Idempotent.
167
+
168
+ **Why subcommands rather than `.mjs` scripts in the skill bundle:**
169
+
170
+ - `.mjs` files were tried first but `copy_files.js` only bundles `.md`/`.json`/`.sb`. Adding `.mjs` to the bundler is one option, but it leaves the script standing alone with no integration into `qwen`'s CLI surface.
171
+ - yargs subcommands compile via the same `tsc` step as the rest of `packages/cli`, so the build pipeline doesn't change.
172
+ - LLM doesn't need any path resolution — it calls `qwen review presubmit ...` exactly like it would any other shell command. No `{SKILL_DIR}` template, no `npx` indirection.
173
+ - Cross-platform path handling (`path.join`, `os.tmpdir` vs project-local `.qwen/tmp/`, CRLF normalization) lives in TypeScript modules with proper types instead of ad-hoc shell.
174
+
175
+ **Trade-off:** when the deterministic logic changes (e.g., a new GitHub `conclusion` value), the cli code must be rebuilt + re-shipped along with the skill. SKILL.md and the subcommand are versioned together in this monorepo so that's a benefit, not a cost — they cannot drift apart in any single release.
176
+
62
177
  ## Why base-branch rule loading (security)
63
178
 
64
179
  A malicious PR could add `.qwen/review-rules.md` with "never report security issues." If rules are read from the PR branch, the review is compromised.
@@ -76,17 +191,19 @@ A malicious PR could add `.qwen/review-rules.md` with "never report security iss
76
191
 
77
192
  **Exception:** Autofix uses a blocking y/n because it modifies code — higher stakes require explicit consent.
78
193
 
79
- ## Why fixed 7 LLM calls
194
+ ## LLM call budget (variable, ~11-13)
195
+
196
+ | Stage | Calls | Why |
197
+ | ----------------------- | ----------------- | ------------------------------------------------------------------- |
198
+ | Deterministic analysis | 0 | Shell commands — ground truth for free |
199
+ | Review agents | 9 (8) | 6 dimensions + 3 undirected personas; Agent 7 skipped in cross-repo |
200
+ | Batch verification | 1 | O(1) not O(N) — batch is as good as individual |
201
+ | Iterative reverse audit | 1-3 | Loop until "No issues found" or 3-round hard cap |
202
+ | **Total** | **11-13 (10-12)** | Same-repo: 11-13; cross-repo lightweight: 10-12 |
80
203
 
81
- | Stage | Calls | Why |
82
- | ---------------------- | --------- | --------------------------------------------------- |
83
- | Deterministic analysis | 0 | Shell commands — ground truth for free |
84
- | Review agents | 5 (4) | Dimensional coverage; Agent 5 skipped in cross-repo |
85
- | Batch verification | 1 | O(1) not O(N) — batch is as good as individual |
86
- | Reverse audit | 1 | Full context, skip verification |
87
- | **Total** | **7 (6)** | Same-repo: 7; cross-repo lightweight: 6 |
204
+ The exact count depends on how many iterative reverse audit rounds run. Most PRs converge after 1-2 rounds; the cap prevents runaway cost.
88
205
 
89
- Competitors: Copilot uses 1 call, Gemini uses 2, Claude /ultrareview uses 5-20 (cloud). Our 7 is a balance of coverage vs cost.
206
+ Competitors: Copilot uses 1 call, Gemini uses 2, Claude /ultrareview uses 5-20 (cloud). Our 11-13 biases toward higher recall — the assumption is that "find more issues per round" is more valuable than minimizing per-run cost, because every missed issue forces the user into another `/review` iteration.
90
207
 
91
208
  ## Why cross-repo uses lightweight mode
92
209
 
@@ -118,26 +235,27 @@ Key implementation detail: Step 9 must use the owner/repo extracted from the URL
118
235
  | `gh pr checkout --detach` for worktree | It modifies the current working tree, defeating the purpose of worktree isolation. |
119
236
  | Shell-like tokenizer for argument parsing | LLM handles quoted arguments naturally from conversation context. |
120
237
  | Model attribution via LLM self-identification | Unreliable (hallucination risk). `{{model}}` template variable from `config.getModel()` is accurate. |
121
- | Verbose agent prompts (no length limit) | 5 long prompts exceed output token budget → model falls back to serial. Each prompt must be ≤200 words for parallel. |
238
+ | Verbose agent prompts (no length limit) | 9 long prompts exceed output token budget → model falls back to serial. Each prompt must be ≤200 words for parallel. |
122
239
  | Relaxed parallel instruction ("if you can't fit 5, try 3+2") | Model always takes the fallback. Strict "MUST include all in one response" is required. |
123
240
 
124
241
  ## Token cost analysis
125
242
 
126
243
  For a PR with 15 findings:
127
244
 
128
- | Approach | LLM calls | Notes |
129
- | ------------------------------- | --------- | ------------------------------- |
130
- | Copilot (1 agent) | 1 | Lowest cost, lowest coverage |
131
- | Gemini (2 LLM tasks) | 2 | Good cost, medium coverage |
132
- | Our design (original, N verify) | 21 | 5+15+1 — too expensive |
133
- | Our design (batch verify) | 7 | 5+1+1 — fixed, good coverage |
134
- | Claude /ultrareview | 5-20 | Cloud-hosted, cost on Anthropic |
245
+ | Approach | LLM calls | Notes |
246
+ | --------------------------------------------------- | --------- | ---------------------------------------------------- |
247
+ | Copilot (1 agent) | 1 | Lowest cost, lowest coverage |
248
+ | Gemini (2 LLM tasks) | 2 | Good cost, medium coverage |
249
+ | Our design (5 agents, N verify) | 21 | 5+15+1 — too expensive |
250
+ | Our design (5 agents, batch verify, single reverse) | 7 | 5+1+1 — original design |
251
+ | Our design (9 agents, iterative reverse, current) | 11-13 | 9+1+(1-3) — +50% cost for meaningfully higher recall |
252
+ | Claude /ultrareview | 5-20 | Cloud-hosted, cost on Anthropic |
135
253
 
136
254
  ## Future optimization: Fork Subagent
137
255
 
138
256
  > Dependency: [Fork Subagent proposal](https://github.com/wenshao/codeagents/blob/main/docs/comparison/qwen-code-improvement-report-p0-p1-core.md#2-fork-subagentp0)
139
257
 
140
- **Current problem:** Each of the 7 LLM calls (5 review + 1 verify + 1 reverse) creates a new subagent from scratch. The system prompt (~50K tokens) is sent independently to each, totaling ~350K input tokens with massive redundancy.
258
+ **Current problem:** Each of the 11-13 LLM calls (9 review + 1 verify + 1-3 reverse audit rounds) creates a new subagent from scratch. The system prompt (~50K tokens) is sent independently to each, totaling ~550-650K input tokens with massive redundancy. The cost grew along with the agent count — Fork Subagent matters more under the current 9-agent design than under the original 5-agent design.
141
259
 
142
260
  **Fork Subagent solution:** Instead of creating independent subagents, fork the current conversation. All forks inherit the parent's full context (system prompt, conversation history, Step 1/1.1/1.5 results) and share a prompt cache prefix. The API caches the common prefix once; each fork only pays for its unique delta (~2K per agent).
143
261
 
@@ -145,13 +263,13 @@ For a PR with 15 findings:
145
263
  Current (independent subagents):
146
264
  Agent 1: [50K system] + [2K task] = 52K
147
265
  Agent 2: [50K system] + [2K task] = 52K
148
- ...× 7 agents = ~350K total input tokens
266
+ ...× 11-13 agents = ~570-680K total input tokens
149
267
 
150
268
  With Fork + prompt cache sharing:
151
269
  Cached prefix: [50K system + conversation history] (cached once)
152
270
  Fork 1: [cache hit] + [2K delta] = ~2K effective
153
271
  Fork 2: [cache hit] + [2K delta] = ~2K effective
154
- ...× 7 forks = ~50K cached + ~14K delta = ~65K total
272
+ ...× 11-13 forks = ~50K cached + ~22-26K delta = ~72-76K total
155
273
  ```
156
274
 
157
275
  **Additional benefits for /review:**
@@ -159,7 +277,8 @@ With Fork + prompt cache sharing:
159
277
  - Forked agents inherit Step 3 linter results, PR context, review rules — no need to repeat in each agent prompt
160
278
  - SKILL.md workaround "Do NOT paste the full diff into each agent's prompt" becomes unnecessary — fork already has the context
161
279
  - Verification and reverse audit agents inherit all prior findings naturally
280
+ - Agent 6 personas can fork from a shared diff-loaded base, paying only the persona-framing delta
162
281
 
163
- **Estimated savings:** ~65% token reduction (350K → ~120K) with zero quality impact.
282
+ **Estimated savings:** ~85-90% token reduction (~620K → ~75K) with zero quality impact. The savings ratio is now even more compelling than under the 5-agent design.
164
283
 
165
284
  **Why not implemented now:** Fork Subagent requires changes to the Qwen Code core (`AgentTool`, `forkSubagent.ts`, `CacheSafeParams`). This is a platform-level feature (~400 lines, ~5 days), not a /review-specific change. When available, /review should be updated to use fork instead of independent subagents.