npm - okstra - Versions diffs - 0.45.1 → 0.47.0 - Mend

okstra 0.45.1 → 0.47.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

package/runtime/skills/okstra-convergence/SKILL.md CHANGED Viewed

@@ -46,6 +46,7 @@ Configure this in the `convergence` block of `task-manifest.json`. If the block
 | `enabled` | `true` | If `false`, skip the convergence loop and use the existing consensus/divergence method |
 | `maxRounds` | phase-aware: `1` for `requirements-discovery`, `2` otherwise (range 1–3) | Maximum number of re-verification rounds. Discovery's routing/missing-input outputs gain little from a second round; other phases (especially `error-analysis`) keep `2`. Lead resolves the effective value when the manifest omits the key and records it in `config.maxRounds` of the convergence state artifact. |
 | `verificationMode` | `"lightweight"` | `"lightweight"` or `"full-reanalysis"` |
+| `adversarial` | phase-aware: `true` for `requirements-discovery` / `error-analysis`, `false` otherwise | When `true`, Phase 5.5 runs in **adversarial mode** (see §"Adversarial Verification Mode"): verifiers actively try to refute each finding, the burden of proof sits on the claim, and `verificationMode` is forced to `"full-reanalysis"` scoped to the finding's cited evidence. Resolved by `scripts/okstra_ctl/render.py` `_build_convergence_block` and recorded in `config.adversarial` of the convergence state artifact. |
 **Auto-disable rule (BLOCKING).** Convergence requires ≥2 analyser workers to produce a meaningful consensus tally. When the active profile's `Required workers:` block (see `prompts/profiles/*.md`) resolves to fewer than 2 analyser workers — e.g. `release-handoff` (zero analyser workers, lead-only) — the lead MUST treat `convergence.enabled` as `false` for that run regardless of manifest configuration, skip Phases 5.5 and the plan-body verification round, and record `finalState: "converged"` with `totalRounds: 0` and an explanatory note in `config` (e.g. `"autoDisabled": "fewer-than-two-analysers"`). The plan-body round inherits the same rule via its `gating=false` advisory path.
@@ -192,6 +193,62 @@ Use the findings as a guide, but reanalyze the original code/data yourself.
 Advantages: High accuracy
 Disadvantages: 2–3 times the cost, increased time
+## Adversarial Verification Mode
+Active only when `config.adversarial == true` (default for `requirements-discovery` and `error-analysis`; see §"Configuration"). When `false`, every rule in this section is inert and the collaborative behaviour documented elsewhere in this skill applies unchanged.
+In adversarial mode the verifier's job inverts: instead of confirming a peer's finding, the verifier **tries to break it**, and the burden of proof sits on the claim — a finding survives only if refutation attempts fail.
+### Scoped full-reanalysis (BLOCKING)
+Adversarial mode forces `verificationMode = "full-reanalysis"`, but the re-analysis is **scoped to the evidence the finding under attack cites** (the file paths / line ranges / log lines in its `originEvidence`), plus the immediately surrounding context. The verifier MUST NOT re-read the whole task brief, instruction-set, or `final-report-template.md`. This keeps the documented "single largest avoidable cost in requirements-discovery and error-analysis" (see §"Reverify prompt: required-reading suppression") bounded while making the refutation real rather than a text-only argument.
+### Adversarial verdict semantics
+The persisted `verdict` enum is unchanged (`agree | disagree | supplement | verification-error`). The prompt-facing labels are adversarial and map down on persistence:
+| Prompt label | Persisted `verdict` | Meaning |
+|---|---|---|
+| SURVIVES | `agree` | Actively tried to refute and failed — the claim withstood the attack. |
+| SURVIVES-WITH-CAVEAT | `supplement` | Holds, but a scope limit / extra condition / precondition was found. |
+| REFUTED | `disagree` | The claim was broken (or failed to prove itself). MUST carry a `disagreeBasis`. |
+Each `disagree` vote records a new field `disagreeBasis`:
+| `disagreeBasis` | Meaning |
+|---|---|
+| `counter-evidence` | The verifier cited contradicting evidence (`file:line` / log line) in `explanation`. A **hard refute**. |
+| `burden-not-met` | The verifier re-inspected the cited evidence and could neither confirm nor refute → the claim failed to prove itself ("when uncertain, lean to rejection"). |
+A `disagree` with `disagreeBasis == null` is a contract violation in adversarial mode — every refutation must state which of the two grounds it rests on. Bare "I disagree" without re-inspection is not allowed.
+### Adversarial classification (replaces the §"Convergence Algorithm" per-round classifier when `adversarial == true`)
+`verification-error` votes are excluded from numerator and denominator exactly as in the collaborative classifier. For each finding `F` in the queue at a round:
+```text
+disagrees    = [v for v in non-error votes if v.verdict == "disagree"]
+hard_refutes = [v for v in disagrees if v.disagreeBasis == "counter-evidence"]
+all_others_disagree = (every non-discoverer non-error vote is "disagree")
+IF len(disagrees) == 0:
+    resolve F as "full-consensus"   (or "partial-consensus" if any SUPPLEMENT/caveat)
+ELIF all_others_disagree:
+    resolve F as "worker-unique"    # only the discoverer still holds it
+ELIF len(hard_refutes) >= 1:
+    # an evidence-backed refute exists and the roster is split → the claim is disputed
+    carry F forward; at the LAST executed round classify it "contested"
+ELIF burden-not-met disagrees are a majority of non-error votes (per the Majority definition in the Convergence Algorithm section):
+    carry F forward; at the LAST executed round classify it "contested"
+ELSE:
+    # a lone weak (burden-not-met) doubt against an otherwise-surviving claim
+    resolve F as "partial-consensus"
+```
+`contested` remains a **final classification only** (per §"Scope and Terminology"): a disputed finding is carried forward through intermediate rounds and labelled `contested` only at the last executed round. For `requirements-discovery` (`effectiveMaxRounds = 1`) the single round IS the last round, so a split-with-hard-refute finding is labelled `contested` in that one round. The final-classifier block of §"Convergence Algorithm" is unchanged; this section only changes how each round's verdicts resolve into queue actions.
+Design intent: one `counter-evidence` refute is enough to deny a claim consensus (it cannot rise above `contested` no matter how many others AGREE), while a single `burden-not-met` doubt does not by itself sink an otherwise-surviving claim — only a majority of burden-not-met doubts does. When every non-discoverer refutes (all_others_disagree), the finding is worker-unique regardless of whether those refutes were counter-evidence or burden-not-met — only the discoverer still holds it. A SUPPLEMENT/caveat with zero disagrees lands partial-consensus rather than full-consensus, because a caveat means the claim does not pass cleanly (this differs from the collaborative classifier, where SUPPLEMENT counts as full agreement).
 ## Re-verification Agent Dispatch
 ### Sponsorship Optimization
@@ -282,6 +339,55 @@ For each finding, respond as:
 **Verdict**: ...
 ```
+### Adversarial Re-verification Prompt
+Used instead of the lightweight/full-reanalysis prompt when `config.adversarial == true`. The required anchor headers (§"Required reverify-prompt anchor headers") are identical. The `[Required reading]` clause is suppressed; only the cited-evidence paths of the items under attack are injected (see §"Adversarial Verification Mode" → Scoped full-reanalysis).
+```
+You are <worker-role> performing ADVERSARIAL re-verification for <task-key> (round <N>).
+## Instructions
+Your job is to BREAK each finding below, not to confirm it. For EACH finding,
+open the cited evidence directly and actively search for evidence that the claim
+is wrong, overstated, or unproven. Then respond with exactly one verdict:
+- **REFUTED**: You broke the claim. State the basis:
+  - counter-evidence — you found contradicting evidence (give file:line or log line), OR
+  - burden-not-met — you re-inspected the cited evidence and could neither confirm
+    nor refute it (the claim has not proven itself).
+- **SURVIVES**: You actively tried to refute it and failed — the claim withstood the attack.
+- **SURVIVES-WITH-CAVEAT**: It holds, but a scope limit / extra condition / missing
+  precondition exists (state it).
+The burden of proof is on the claim. If after inspecting the cited evidence you remain
+uncertain, your verdict is REFUTED with basis = burden-not-met.
+Inspect ONLY the evidence each finding cites and its immediate surroundings. Do NOT
+re-read the task brief, instruction-set, or report template.
+## Findings to verify
+### F-001: <one-line summary>
+**Origin**: <worker role>
+**Cited evidence**: <file paths, line numbers, log lines from origin worker>
+### F-002: <one-line summary>
+...
+## Response format
+### F-001
+**Verdict**: REFUTED | SURVIVES | SURVIVES-WITH-CAVEAT
+**Basis** (only if REFUTED): counter-evidence | burden-not-met
+**Explanation**: <2-3 sentences; for counter-evidence include the file:line you found>
+### F-002
+...
+```
+When persisting votes, map SURVIVES→`agree`, SURVIVES-WITH-CAVEAT→`supplement`, REFUTED→`disagree`, and copy the stated Basis into `votes.<worker>.disagreeBasis` (null for non-REFUTED verdicts).
 ### Full Re-analysis Re-verification Prompt
 ```
@@ -324,10 +430,11 @@ Save it to `runs/<task-type>/state/convergence-<task-type>-<seq>.json`.
 ```json
 {
-  "schemaVersion": "1.1",
+  "schemaVersion": "1.2",
   "taskKey": "<task-key>",
   "config": {
     "enabled": true,
+    "adversarial": false,
     "maxRounds": 2,
     "effectiveMaxRounds": 2,
     "verificationMode": "lightweight"
@@ -345,7 +452,7 @@ Save it to `runs/<task-type>/state/convergence-<task-type>-<seq>.json`.
         {
           "round": 1,
           "votes": {
-            "codex-worker": { "verdict": "agree", "explanation": "<brief>" },
+            "codex-worker": { "verdict": "agree", "disagreeBasis": null, "explanation": "<brief>" },
             "gemini-worker": { "verdict": "supplement", "explanation": "<brief>" }
           }
         }
@@ -385,11 +492,13 @@ Save it to `runs/<task-type>/state/convergence-<task-type>-<seq>.json`.
 Schema rules:
-- `schemaVersion`: literal string `"1.1"` for new runs. Readers MUST accept `"1.0"` for historical artifacts and treat any missing v1.1 field as `null`.
+- `schemaVersion`: literal string `"1.2"` for all new runs — both adversarial and collaborative. v1.2 adds `config.adversarial` and `votes.<worker>.disagreeBasis`, written as `false` / `null` respectively on collaborative runs. Readers MUST accept `"1.0"` / `"1.1"` / `"1.2"` for historical artifacts and treat any missing field as `null`.
+- `config.adversarial`: boolean. `true` when this run used adversarial verification (default for `requirements-discovery` / `error-analysis`). When `true`, `config.verificationMode` is `"full-reanalysis"` (scoped) and every `disagree` vote carries a non-null `disagreeBasis`.
 - `config.effectiveMaxRounds`: the integer the lead actually used after resolving the phase-aware default (`1` for `requirements-discovery`, `2` otherwise). MUST equal `config.maxRounds` when the manifest explicitly set it.
 - `findings[].ticketIds`: array of ticket keys from Phase 4 grouping (parsed per the Round 0 step 5 rule). MAY be empty when the discovering worker tagged the finding `unknown`.
 - `findings[].rounds[].votes.<worker>.verdict`: enum, one of `agree | disagree | supplement | verification-error`. Lower-case tokens; map upper-case AGREE/DISAGREE/SUPPLEMENT verdicts emitted by workers to their lower-case form before persisting. `verification-error` is reserved for terminal non-result dispatches (§"Worker failure handling in reverify").
-- `findings[].classification`: enum, one of `full-consensus | partial-consensus | worker-unique | contested`. No other value is permitted in v1.1.
+- `findings[].rounds[].votes.<worker>.disagreeBasis`: enum `counter-evidence | burden-not-met | null`. Non-null only when `verdict == "disagree"` AND `config.adversarial == true`; `null` (or absent, treated as null) otherwise. See §"Adversarial Verification Mode".
+- `findings[].classification`: enum, one of `full-consensus | partial-consensus | worker-unique | contested`. No other value is permitted.
 - `roundHistory[].inputQueueSize`: queue size at the start of this round.
 - `roundHistory[].resolvedCount`: number of findings that exited the queue this round (sum of full+partial+worker-unique classifications produced this round).
 - `roundHistory[].carriedForwardCount`: queue size at the END of this round — the single definition. In-round insertions into the queue are forbidden, so this always equals `inputQueueSize - resolvedCount`. The pseudocode's per-item `carriedForwardCount += 1` accumulator is a counting convenience that lands on the same value; persist the post-round queue length, not the loop accumulator, if the two ever diverge.
@@ -397,7 +506,7 @@ Schema rules:
 - `roundHistory[].skippedWorkers[]`: per-worker `{worker, reason}` for workers with no items to verify OR with a non-result dispatch.
 - `round2SkippedReason`: literal enum `queue-empty | max-rounds-1 | all-reverify-non-result | not-skipped`. Always present. Use `"not-skipped"` when Round 2 actually ran. Use `"max-rounds-1"` when `effectiveMaxRounds == 1` (Round 2 was never attempted). Use `"queue-empty"` when Round 1 fully drained the queue. Use `"all-reverify-non-result"` when all Round 1 dispatches terminated as non-result.
 - `finalClassificationCounts`: post-loop counts. Required field with keys `fullConsensus`, `partialConsensus`, `contested`, `workerUnique`.
-- `finalState ∈ {converged, max-rounds-reached, aborted-non-result}`. Assigned by the lead at WHILE-loop exit: `converged` when the queue is empty at the end of any round; `max-rounds-reached` when the loop exits because `roundIndex == effectiveMaxRounds` with the queue still non-empty; `aborted-non-result` when the loop exits via the Worker-failure BREAK (Task 3's "Worker failure handling in reverify" rule 4). `aborted-non-result` is the new v1.1 value.
+- `finalState ∈ {converged, max-rounds-reached, aborted-non-result}`. Assigned by the lead at WHILE-loop exit: `converged` when the queue is empty at the end of any round; `max-rounds-reached` when the loop exits because `roundIndex == effectiveMaxRounds` with the queue still non-empty; `aborted-non-result` when the loop exits via the Worker-failure BREAK (per the "Worker failure handling in reverify" section, rule 4). `aborted-non-result` is the new v1.1 value.
 - `totalRounds`: count of rounds actually executed (not `effectiveMaxRounds`). May be `0` when Round 0 produced no queue items (all findings reached consensus during grouping).
 ## Output

package/runtime/validators/validate-implementation-plan-stages.py CHANGED Viewed

@@ -1,5 +1,5 @@
 #!/usr/bin/env python3
-"""S1–S8 checks for the Stage Map structure of an approved
+"""S1–S9 checks for the Stage Map structure of an approved
 implementation-planning final-report.md. Run from prepare_task_bundle
 of `implementation` task or standalone."""
@@ -23,6 +23,11 @@ REQUIRED_SUBSECTIONS = (
     "Stage Validation",
 )
+EXIT_CONTRACT_HEADING = re.compile(r"^###\s+Stage Exit Contract\b", re.M)
+# best-effort path token: only slash-containing paths count as files, so
+# endpoints (`/bar`), env vars (`BAZ_MODE`), and extensionless tokens are skipped.
+PATH_TOKEN = re.compile(r"(?:[\w.@-]+/)+[\w.@-]+")
 @dataclass
 class StageMeta:
@@ -35,7 +40,7 @@ class StageMeta:
 @dataclass
 class ValidationError:
-    code: str   # S1..S8
+    code: str   # S1..S9
     stage: int  # 0 = global
     message: str
@@ -85,6 +90,20 @@ def _parse_stage_map(text: str) -> Tuple[List[StageMeta], List[ValidationError]]
     return rows, errors
+def _slice_stage_section(text: str, stage_number: int) -> str:
+    """Return the body of `## 4.5.<n> Stage <n>:` up to the next stage heading."""
+    start_m = re.search(
+        rf"^##\s+4\.5\.{stage_number}\s+Stage\s+{stage_number}\s*:", text, re.M
+    )
+    if not start_m:
+        return ""
+    start = start_m.end()
+    nxt = re.search(
+        rf"^##\s+4\.5\.{stage_number + 1}\s+Stage\s+", text[start:], re.M
+    )
+    return text[start: start + nxt.start()] if nxt else text[start:]
 def _count_effective_steps(section: str) -> int:
     m = re.search(r"^###\s+Stepwise Execution Order\b", section, re.M)
     if not m:
@@ -114,19 +133,13 @@ def _count_effective_steps(section: str) -> int:
 def _check_each_stage_section(text: str, stages: List[StageMeta]) -> List[ValidationError]:
     errs: List[ValidationError] = []
     for s in stages:
-        pattern = rf"^##\s+4\.5\.{s.stage_number}\s+Stage\s+{s.stage_number}\s*:"
-        start_m = re.search(pattern, text, re.M)
-        if not start_m:
+        if not re.search(
+            rf"^##\s+4\.5\.{s.stage_number}\s+Stage\s+{s.stage_number}\s*:", text, re.M
+        ):
             errs.append(ValidationError("S3", s.stage_number,
                 f"stage section '## 4.5.{s.stage_number} Stage {s.stage_number}:' missing"))
             continue
-        # Slice the stage's section body
-        start = start_m.end()
-        nxt = re.search(
-            rf"^##\s+4\.5\.{s.stage_number + 1}\s+Stage\s+",
-            text[start:], re.M,
-        )
-        section = text[start: start + nxt.start()] if nxt else text[start:]
+        section = _slice_stage_section(text, s.stage_number)
         for sub in REQUIRED_SUBSECTIONS:
             if not re.search(rf"^###\s+{re.escape(sub)}\b", section, re.M):
@@ -181,8 +194,42 @@ def _check_depends_on(stages: List[StageMeta]) -> List[ValidationError]:
     return errs
+def _extract_exit_contract_files(section: str) -> set:
+    m = EXIT_CONTRACT_HEADING.search(section)
+    if not m:
+        return set()
+    body = section[m.end():]
+    nxt = re.search(r"^###\s+\w", body, re.M)
+    if nxt:
+        body = body[: nxt.start()]
+    return set(PATH_TOKEN.findall(body))
+def _check_parallel_safety(text: str, stages: List[StageMeta]) -> List[ValidationError]:
+    """S9: two `depends-on (none)` stages must not predict the same file —
+    otherwise two parallel implementation runs would edit it concurrently."""
+    files = {
+        s.stage_number: _extract_exit_contract_files(
+            _slice_stage_section(text, s.stage_number)
+        )
+        for s in stages
+        if not s.depends_on
+    }
+    errs: List[ValidationError] = []
+    nums = sorted(files)
+    for i in range(len(nums)):
+        for j in range(i + 1, len(nums)):
+            a, b = nums[i], nums[j]
+            shared = files[a] & files[b]
+            if shared:
+                errs.append(ValidationError("S9", 0,
+                    f"parallel stages {a} and {b} share predicted file(s): "
+                    f"{', '.join(sorted(shared))}"))
+    return errs
 def collect_validation_errors(text: str) -> List[ValidationError]:
-    """All S1–S8 checks against the report text; empty list means valid.
+    """All S1–S9 checks against the report text; empty list means valid.
     S1 (missing `## 4.5 Stage Map` heading) makes the rest unparseable, so it
     short-circuits. Shared by `main()` (CLI / implementation entry) and the
@@ -198,6 +245,7 @@ def collect_validation_errors(text: str) -> List[ValidationError]:
     if stages:
         errors.extend(_check_each_stage_section(text, stages))
         errors.extend(_check_depends_on(stages))
+        errors.extend(_check_parallel_safety(text, stages))
     return errors