npm - okstra - Versions diffs - 0.46.0 → 0.48.0 - Mend

okstra 0.46.0 → 0.48.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

package/runtime/skills/okstra-convergence/SKILL.md CHANGED Viewed

@@ -17,8 +17,11 @@ user-invocable: false
   - [Round 1-N: Re-verification Loop (queue-pruned)](#round-1-n-re-verification-loop-queue-pruned)
   - [Convergence Test](#convergence-test)
 - [Verification Mode](#verification-mode)
+- [Adversarial Verification Mode](#adversarial-verification-mode)
 - [Re-verification Agent Dispatch](#re-verification-agent-dispatch)
 - [Convergence State Artifact](#convergence-state-artifact)
+- [Coverage critic pass](#coverage-critic-pass)
+- [Acceptance critic pass (final-verification)](#acceptance-critic-pass-final-verification)
 - [Output](#output)
 - [Convergence Disabled](#convergence-disabled)
 - [Plan-body verification mode (implementation-planning only)](#plan-body-verification-mode-implementation-planning-only)
@@ -46,6 +49,7 @@ Configure this in the `convergence` block of `task-manifest.json`. If the block
 | `enabled` | `true` | If `false`, skip the convergence loop and use the existing consensus/divergence method |
 | `maxRounds` | phase-aware: `1` for `requirements-discovery`, `2` otherwise (range 1–3) | Maximum number of re-verification rounds. Discovery's routing/missing-input outputs gain little from a second round; other phases (especially `error-analysis`) keep `2`. Lead resolves the effective value when the manifest omits the key and records it in `config.maxRounds` of the convergence state artifact. |
 | `verificationMode` | `"lightweight"` | `"lightweight"` or `"full-reanalysis"` |
+| `adversarial` | phase-aware: `true` for `requirements-discovery` / `error-analysis` / `implementation-planning`, `false` otherwise | When `true`, Phase 5.5 runs in **adversarial mode** (see §"Adversarial Verification Mode"): verifiers actively try to refute each finding, the burden of proof sits on the claim, and `verificationMode` is forced to `"full-reanalysis"` scoped to the finding's cited evidence. Resolved by `scripts/okstra_ctl/render.py` `_build_convergence_block` and recorded in `config.adversarial` of the convergence state artifact. |
 **Auto-disable rule (BLOCKING).** Convergence requires ≥2 analyser workers to produce a meaningful consensus tally. When the active profile's `Required workers:` block (see `prompts/profiles/*.md`) resolves to fewer than 2 analyser workers — e.g. `release-handoff` (zero analyser workers, lead-only) — the lead MUST treat `convergence.enabled` as `false` for that run regardless of manifest configuration, skip Phases 5.5 and the plan-body verification round, and record `finalState: "converged"` with `totalRounds: 0` and an explanatory note in `config` (e.g. `"autoDisabled": "fewer-than-two-analysers"`). The plan-body round inherits the same rule via its `gating=false` advisory path.
@@ -192,6 +196,62 @@ Use the findings as a guide, but reanalyze the original code/data yourself.
 Advantages: High accuracy
 Disadvantages: 2–3 times the cost, increased time
+## Adversarial Verification Mode
+Active only when `config.adversarial == true` (default for `requirements-discovery`, `error-analysis`, and `implementation-planning`; see §"Configuration"). When `false`, every rule in this section is inert and the collaborative behaviour documented elsewhere in this skill applies unchanged.
+In adversarial mode the verifier's job inverts: instead of confirming a peer's finding, the verifier **tries to break it**, and the burden of proof sits on the claim — a finding survives only if refutation attempts fail.
+### Scoped full-reanalysis (BLOCKING)
+Adversarial mode forces `verificationMode = "full-reanalysis"`, but the re-analysis is **scoped to the evidence the finding under attack cites** (the file paths / line ranges / log lines in its `originEvidence`), plus the immediately surrounding context. The verifier MUST NOT re-read the whole task brief, instruction-set, or `final-report-template.md`. This keeps the documented "single largest avoidable cost in requirements-discovery, error-analysis, and implementation-planning" (see §"Reverify prompt: required-reading suppression") bounded while making the refutation real rather than a text-only argument.
+### Adversarial verdict semantics
+The persisted `verdict` enum is unchanged (`agree | disagree | supplement | verification-error`). The prompt-facing labels are adversarial and map down on persistence:
+| Prompt label | Persisted `verdict` | Meaning |
+|---|---|---|
+| SURVIVES | `agree` | Actively tried to refute and failed — the claim withstood the attack. |
+| SURVIVES-WITH-CAVEAT | `supplement` | Holds, but a scope limit / extra condition / precondition was found. |
+| REFUTED | `disagree` | The claim was broken (or failed to prove itself). MUST carry a `disagreeBasis`. |
+Each `disagree` vote records a new field `disagreeBasis`:
+| `disagreeBasis` | Meaning |
+|---|---|
+| `counter-evidence` | The verifier cited contradicting evidence (`file:line` / log line) in `explanation`. A **hard refute**. |
+| `burden-not-met` | The verifier re-inspected the cited evidence and could neither confirm nor refute → the claim failed to prove itself ("when uncertain, lean to rejection"). |
+A `disagree` with `disagreeBasis == null` is a contract violation in adversarial mode — every refutation must state which of the two grounds it rests on. Bare "I disagree" without re-inspection is not allowed.
+### Adversarial classification (replaces the §"Convergence Algorithm" per-round classifier when `adversarial == true`)
+`verification-error` votes are excluded from numerator and denominator exactly as in the collaborative classifier. For each finding `F` in the queue at a round:
+```text
+disagrees    = [v for v in non-error votes if v.verdict == "disagree"]
+hard_refutes = [v for v in disagrees if v.disagreeBasis == "counter-evidence"]
+all_others_disagree = (every non-discoverer non-error vote is "disagree")
+IF len(disagrees) == 0:
+    resolve F as "full-consensus"   (or "partial-consensus" if any SUPPLEMENT/caveat)
+ELIF all_others_disagree:
+    resolve F as "worker-unique"    # only the discoverer still holds it
+ELIF len(hard_refutes) >= 1:
+    # an evidence-backed refute exists and the roster is split → the claim is disputed
+    carry F forward; at the LAST executed round classify it "contested"
+ELIF burden-not-met disagrees are a majority of non-error votes (per the Majority definition in the Convergence Algorithm section):
+    carry F forward; at the LAST executed round classify it "contested"
+ELSE:
+    # a lone weak (burden-not-met) doubt against an otherwise-surviving claim
+    resolve F as "partial-consensus"
+```
+`contested` remains a **final classification only** (per §"Scope and Terminology"): a disputed finding is carried forward through intermediate rounds and labelled `contested` only at the last executed round. For `requirements-discovery` (`effectiveMaxRounds = 1`) the single round IS the last round, so a split-with-hard-refute finding is labelled `contested` in that one round. The final-classifier block of §"Convergence Algorithm" is unchanged; this section only changes how each round's verdicts resolve into queue actions.
+Design intent: one `counter-evidence` refute is enough to deny a claim consensus (it cannot rise above `contested` no matter how many others AGREE), while a single `burden-not-met` doubt does not by itself sink an otherwise-surviving claim — only a majority of burden-not-met doubts does. When every non-discoverer refutes (all_others_disagree), the finding is worker-unique regardless of whether those refutes were counter-evidence or burden-not-met — only the discoverer still holds it. A SUPPLEMENT/caveat with zero disagrees lands partial-consensus rather than full-consensus, because a caveat means the claim does not pass cleanly (this differs from the collaborative classifier, where SUPPLEMENT counts as full agreement).
 ## Re-verification Agent Dispatch
 ### Sponsorship Optimization
@@ -242,7 +302,7 @@ Reverify prompts MUST NOT inject the Phase 2 `[Required reading]` clause:
 - **Lightweight mode**: the clause directly contradicts the "Do NOT re-analyze the original source materials" instruction below. Including it forces workers to re-read the entire instruction-set per round per worker (3 workers × 2 rounds × 5+ files in the worst case) for no quality gain.
 - **Full-reanalysis mode**: workers DO need to re-read source materials, but only the analysis-worker file list (no `final-report-template.md`). If lead chooses to inject a reading clause here, it MUST mirror the audience-scoped enumeration in [okstra/SKILL.md](../../SKILL.md) Phase 2 (no template).
-This is the single largest avoidable cost in `requirements-discovery` and `error-analysis` runs. Treat as BLOCKING.
+This is the single largest avoidable cost in `requirements-discovery`, `error-analysis`, and `implementation-planning` runs. Treat as BLOCKING.
 ### Lightweight Re-verification Prompt
@@ -282,6 +342,55 @@ For each finding, respond as:
 **Verdict**: ...
 ```
+### Adversarial Re-verification Prompt
+Used instead of the lightweight/full-reanalysis prompt when `config.adversarial == true`. The required anchor headers (§"Required reverify-prompt anchor headers") are identical. The `[Required reading]` clause is suppressed; only the cited-evidence paths of the items under attack are injected (see §"Adversarial Verification Mode" → Scoped full-reanalysis).
+```
+You are <worker-role> performing ADVERSARIAL re-verification for <task-key> (round <N>).
+## Instructions
+Your job is to BREAK each finding below, not to confirm it. For EACH finding,
+open the cited evidence directly and actively search for evidence that the claim
+is wrong, overstated, or unproven. Then respond with exactly one verdict:
+- **REFUTED**: You broke the claim. State the basis:
+  - counter-evidence — you found contradicting evidence (give file:line or log line), OR
+  - burden-not-met — you re-inspected the cited evidence and could neither confirm
+    nor refute it (the claim has not proven itself).
+- **SURVIVES**: You actively tried to refute it and failed — the claim withstood the attack.
+- **SURVIVES-WITH-CAVEAT**: It holds, but a scope limit / extra condition / missing
+  precondition exists (state it).
+The burden of proof is on the claim. If after inspecting the cited evidence you remain
+uncertain, your verdict is REFUTED with basis = burden-not-met.
+Inspect ONLY the evidence each finding cites and its immediate surroundings. Do NOT
+re-read the task brief, instruction-set, or report template.
+## Findings to verify
+### F-001: <one-line summary>
+**Origin**: <worker role>
+**Cited evidence**: <file paths, line numbers, log lines from origin worker>
+### F-002: <one-line summary>
+...
+## Response format
+### F-001
+**Verdict**: REFUTED | SURVIVES | SURVIVES-WITH-CAVEAT
+**Basis** (only if REFUTED): counter-evidence | burden-not-met
+**Explanation**: <2-3 sentences; for counter-evidence include the file:line you found>
+### F-002
+...
+```
+When persisting votes, map SURVIVES→`agree`, SURVIVES-WITH-CAVEAT→`supplement`, REFUTED→`disagree`, and copy the stated Basis into `votes.<worker>.disagreeBasis` (null for non-REFUTED verdicts).
 ### Full Re-analysis Re-verification Prompt
 ```
@@ -324,10 +433,11 @@ Save it to `runs/<task-type>/state/convergence-<task-type>-<seq>.json`.
 ```json
 {
-  "schemaVersion": "1.1",
+  "schemaVersion": "1.2",
   "taskKey": "<task-key>",
   "config": {
     "enabled": true,
+    "adversarial": false,
     "maxRounds": 2,
     "effectiveMaxRounds": 2,
     "verificationMode": "lightweight"
@@ -345,7 +455,7 @@ Save it to `runs/<task-type>/state/convergence-<task-type>-<seq>.json`.
         {
           "round": 1,
           "votes": {
-            "codex-worker": { "verdict": "agree", "explanation": "<brief>" },
+            "codex-worker": { "verdict": "agree", "disagreeBasis": null, "explanation": "<brief>" },
             "gemini-worker": { "verdict": "supplement", "explanation": "<brief>" }
           }
         }
@@ -385,11 +495,13 @@ Save it to `runs/<task-type>/state/convergence-<task-type>-<seq>.json`.
 Schema rules:
-- `schemaVersion`: literal string `"1.1"` for new runs. Readers MUST accept `"1.0"` for historical artifacts and treat any missing v1.1 field as `null`.
+- `schemaVersion`: literal string `"1.2"` for all new runs — both adversarial and collaborative. v1.2 adds `config.adversarial` and `votes.<worker>.disagreeBasis`, written as `false` / `null` respectively on collaborative runs. Readers MUST accept `"1.0"` / `"1.1"` / `"1.2"` for historical artifacts and treat any missing field as `null`.
+- `config.adversarial`: boolean. `true` when this run used adversarial verification (default for `requirements-discovery` / `error-analysis` / `implementation-planning`). When `true`, `config.verificationMode` is `"full-reanalysis"` (scoped) and every `disagree` vote carries a non-null `disagreeBasis`.
 - `config.effectiveMaxRounds`: the integer the lead actually used after resolving the phase-aware default (`1` for `requirements-discovery`, `2` otherwise). MUST equal `config.maxRounds` when the manifest explicitly set it.
 - `findings[].ticketIds`: array of ticket keys from Phase 4 grouping (parsed per the Round 0 step 5 rule). MAY be empty when the discovering worker tagged the finding `unknown`.
 - `findings[].rounds[].votes.<worker>.verdict`: enum, one of `agree | disagree | supplement | verification-error`. Lower-case tokens; map upper-case AGREE/DISAGREE/SUPPLEMENT verdicts emitted by workers to their lower-case form before persisting. `verification-error` is reserved for terminal non-result dispatches (§"Worker failure handling in reverify").
-- `findings[].classification`: enum, one of `full-consensus | partial-consensus | worker-unique | contested`. No other value is permitted in v1.1.
+- `findings[].rounds[].votes.<worker>.disagreeBasis`: enum `counter-evidence | burden-not-met | null`. Non-null only when `verdict == "disagree"` AND `config.adversarial == true`; `null` (or absent, treated as null) otherwise. See §"Adversarial Verification Mode".
+- `findings[].classification`: enum, one of `full-consensus | partial-consensus | worker-unique | contested`. No other value is permitted.
 - `roundHistory[].inputQueueSize`: queue size at the start of this round.
 - `roundHistory[].resolvedCount`: number of findings that exited the queue this round (sum of full+partial+worker-unique classifications produced this round).
 - `roundHistory[].carriedForwardCount`: queue size at the END of this round — the single definition. In-round insertions into the queue are forbidden, so this always equals `inputQueueSize - resolvedCount`. The pseudocode's per-item `carriedForwardCount += 1` accumulator is a counting convenience that lands on the same value; persist the post-round queue length, not the loop accumulator, if the two ever diverge.
@@ -397,9 +509,69 @@ Schema rules:
 - `roundHistory[].skippedWorkers[]`: per-worker `{worker, reason}` for workers with no items to verify OR with a non-result dispatch.
 - `round2SkippedReason`: literal enum `queue-empty | max-rounds-1 | all-reverify-non-result | not-skipped`. Always present. Use `"not-skipped"` when Round 2 actually ran. Use `"max-rounds-1"` when `effectiveMaxRounds == 1` (Round 2 was never attempted). Use `"queue-empty"` when Round 1 fully drained the queue. Use `"all-reverify-non-result"` when all Round 1 dispatches terminated as non-result.
 - `finalClassificationCounts`: post-loop counts. Required field with keys `fullConsensus`, `partialConsensus`, `contested`, `workerUnique`.
-- `finalState ∈ {converged, max-rounds-reached, aborted-non-result}`. Assigned by the lead at WHILE-loop exit: `converged` when the queue is empty at the end of any round; `max-rounds-reached` when the loop exits because `roundIndex == effectiveMaxRounds` with the queue still non-empty; `aborted-non-result` when the loop exits via the Worker-failure BREAK (Task 3's "Worker failure handling in reverify" rule 4). `aborted-non-result` is the new v1.1 value.
+- `finalState ∈ {converged, max-rounds-reached, aborted-non-result}`. Assigned by the lead at WHILE-loop exit: `converged` when the queue is empty at the end of any round; `max-rounds-reached` when the loop exits because `roundIndex == effectiveMaxRounds` with the queue still non-empty; `aborted-non-result` when the loop exits via the Worker-failure BREAK (per the "Worker failure handling in reverify" section, rule 4). `aborted-non-result` is the new v1.1 value.
 - `totalRounds`: count of rounds actually executed (not `effectiveMaxRounds`). May be `0` when Round 0 produced no queue items (all findings reached consensus during grouping).
+## Coverage critic pass
+Runs only when `convergence.critic.enabled == true` (set by `--critic <provider>` or the okstra-run `critic_pick` step; default off). Applies to the three finding-producing phases (`requirements-discovery`, `error-analysis`, `implementation-planning`); for `final-verification` the critic runs in a different mode — see §"Acceptance critic pass (final-verification)". This pass targets **coverage** (missed findings), distinct from convergence which targets **agreement quality**.
+### When
+After Phase 5.5 finding convergence completes (findings classified) and BEFORE the Phase 6 report-writer dispatch.
+### Dispatch (reused worker)
+Dispatch ONE pass to the `config.critic.provider`'s existing subagent (`claude-worker` / `codex-worker` / `gemini-worker`) with `model = config.critic.modelExecutionValue` — no new agent type. If `config.critic.modelExecutionValue` is null/empty (model could not be resolved), skip the critic pass and record `critic-skipped: model-unresolved` in the convergence state rather than dispatching with no model. Result path: `runs/<task-type>/worker-results/<provider>-critic-<task-type>-<seq>.md`. The critic prompt seeds the consolidated findings and asks ONLY for coverage gaps:
+```
+You are the coverage critic for <task-key>. Below are the findings the workers
+already agreed on. Your ONLY job is to name what is MISSING:
+- files / directories / execution paths nobody inspected,
+- requirements or acceptance points with zero findings,
+- claims raised but never verified.
+For each gap, emit a NEW finding with evidence (file:line or the requirement quote).
+Do NOT restate an existing finding. If nothing is missing, say so explicitly.
+```
+### Gap verification (1 adversarial reverify round)
+Each critic gap enters the verification queue as a finding with `originWorker = "<provider>-critic"` and `source = "critic"`. The lead runs ONE adversarial reverify round (§"Adversarial Verification Mode" classifier) with the Phase 4 analysers (excluding the critic itself) as voters. Only gaps classified `full-consensus` / `partial-consensus` merge into the final report findings; `contested` / `worker-unique` gaps are treated as hallucinations and dropped (recorded in the convergence state, not promoted). If no non-critic analyser is available to vote, the gaps are surfaced as unverified `clarification` items rather than merged, and that fact is recorded.
+### State
+- `convergence.critic` manifest block: `{ enabled, provider, modelExecutionValue }`.
+- Convergence state artifact: critic gaps appear in `findings[]` with `source: "critic"`. Add a `config.critic` summary `{ provider, modelExecutionValue, gapsProposed, gapsMerged }`. `source` and `config.critic` are optional v1.2 fields (readers treat absence as null); no enum changes.
+## Acceptance critic pass (final-verification)
+The `final-verification` phase reuses the SAME reused-worker dispatch as §"Coverage critic pass" (provider + `config.critic.modelExecutionValue` from the `convergence.critic` block; default off; same model-unresolved skip rule). Only the prompt, the verification semantics, and the output sink differ — final-verification's findings are defects/blockers, so the critic acts as an **acceptance devil's advocate** (find reasons NOT to accept), and its candidate blockers are NEVER dropped (that would suppress real defects).
+### Prompt
+```
+You are the acceptance devil's advocate for <task-key>. The delivered work is about
+to be judged for acceptance. Your ONLY job is to find reasons it should NOT be
+accepted — surface candidate acceptance BLOCKERS the verifiers may have missed:
+- requirements / acceptance points with no covering evidence,
+- DB / IO / SQL changes lacking real-execution evidence,
+- regressions or broken error paths,
+- scope / contract violations.
+For each, emit a candidate blocker with a one-line statement, evidence (file:line /
+log / test output), and a severity (critical / major / minor). Do NOT restate an
+existing Acceptance Blocker. If you find none, say so explicitly.
+```
+### Verification — confirm-or-downgrade (BLOCKING)
+Each candidate blocker is verified by the Phase 4 analysers (excluding the critic). Do NOT use the adversarial finding classifier's "uncertain → reject" rule here.
+- **Confirmed** (an analyser reproduces it or cites supporting evidence) → promote to a `## 4 Acceptance Blockers` row (keep severity + recommended follow-up phase).
+- **Not confirmed** (cannot reproduce, or evidence is weak) → **downgrade to a Residual Risk row — never drop it.** Record the escalation trigger so the user can re-judge a high-severity-but-unconfirmed candidate.
+### Verdict impact
+Promoted blockers enter `## 4 Acceptance Blockers`; since `accepted` requires zero blockers, the verdict moves to `conditional-accept` / `blocked` automatically. The existing verdict↔blocker consistency validator (`validators/validate-run.py` `_validate_final_verification_consistency`) enforces this unchanged — no new enum or validator.
+### State
+Critic output lives at `runs/final-verification/worker-results/<provider>-critic-final-verification-<seq>.md`. The convergence state `config.critic` summary (see §"Coverage critic pass") records `mode: "acceptance-devils-advocate"`, `candidatesProposed`, `confirmedBlockers`, `downgradedToResidual` (optional v1.2 fields; readers treat absence as null).
 ## Output
 Information to be passed to Phase 6 after executing this skill:
@@ -491,6 +663,16 @@ Worker non-result handling (`timeout`, `error`, no result file, wrapper `cli-fai
 Plan-body verification only supports **lightweight mode** (defined in §"Verification Mode" above). `full-reanalysis` is not meaningful here because the "original source materials" for a plan item are the worker's own analysis plus the lead-mediated synthesis — there is no independent ground truth to re-read. The manifest's top-level `verificationMode` is ignored for this round; lightweight is always used.
+### Adversarial plan-body posture
+When `config.adversarial == true` (the default for `implementation-planning`; see the top-level §"Configuration" table), the plan-body round runs with an **adversarial posture**. The classification rules and gate arithmetic in §"Round protocol" are UNCHANGED — `majority-disagree` (a *majority* of analysers DISAGREE) remains the only classification that blocks the Approval marker, and `dissent-isolated` still passes the gate. Adversarial mode changes only *how each verifier evaluates an item*:
+- The burden of proof sits on the plan: an item earns `AGREE` only if the verifier actively tried to break it and could not.
+- The verifier MUST open the file paths / symbols / commands the item cites and confirm they exist and are executable as written. This is the one allowed widening of the lightweight "judge from internal consistency and stated commands / paths" rule — confirming the existence of cited paths is not "re-analyzing the original requirements".
+- If a cited path / command / validation signal cannot be confirmed, the verifier responds `DISAGREE(<kind>)` with the applicable breakage kind (a–e); uncertainty resolves toward DISAGREE, not AGREE.
+Plan-body verification stays **lightweight** even under this posture — the `verificationMode = "full-reanalysis"` forcing in §"Adversarial Verification Mode" applies to finding convergence only (see §"Mode constraint"); the adversarial posture here only changes verifier behaviour, not the mode. This raises verification *quality* (active refutation, plan-side burden) without changing the gate *threshold* — a single dissent still does not block approval; a majority is required (deliberate design decision).
 ### Round protocol (single round at default `maxRounds=1`)
 1. Lead parses the report-writer draft and extracts the `P-*` plan items.
@@ -610,6 +792,8 @@ or worker analyses for this round.
 ...
 ```
+When `config.adversarial == true`, the lead prepends the adversarial framing from §"Adversarial plan-body posture" to the `## Instructions` block: the burden of proof is on the plan, the verifier opens and confirms every cited path / command, and an item whose cited references cannot be confirmed is answered `DISAGREE(<kind>)` rather than `AGREE`. The verdict tokens, breakage kinds (a–e), classification, and the majority gate threshold are unchanged. This prepended framing supersedes the template's "Judge solely from plan internal consistency" instruction for the adversarial round.
 The "Reverify prompt: required-reading suppression (BLOCKING)" rule (lightweight mode does NOT inject a `[Required reading]` clause) applies here as well.
 ### Worker non-result handling in plan-body round (BLOCKING)

package/runtime/skills/okstra-run/SKILL.md CHANGED Viewed

@@ -160,6 +160,7 @@ okstra render-bundle \
   --task-type    "<args.task-type>" \
   --task-brief   "<args.task-brief>" \
   --executor     "<args.executor>" \
+  --critic       "<args.critic>" \
   --approved-plan "<args.approved-plan>" \
   --stage        "<args.stage>" \
   --base-ref     "<args.base-ref>" \