npm - okstra - Versions diffs - 0.40.0 → 0.41.0 - Mend

okstra 0.40.0 → 0.41.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/package.json +1 -1
package/runtime/BUILD.json +2 -2
package/runtime/agents/workers/claude-worker.md +1 -0
package/runtime/prompts/profiles/_implementation-executor.md +4 -0
package/runtime/prompts/profiles/_implementation-verifier.md +14 -0

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "okstra",
-  "version": "0.40.0",
+  "version": "0.41.0",
   "description": "Multi-agent cross-verification orchestrator runtime + Claude Code skills.",
   "license": "MIT",
   "author": "devonshin",

package/runtime/BUILD.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "package": "0.40.0",
-  "builtAt": "2026-06-02T15:22:29.217Z",
+  "package": "0.41.0",
+  "builtAt": "2026-06-02T15:52:15.071Z",
   "repoRoot": "/home/runner/work/okstra/okstra"
 }

package/runtime/agents/workers/claude-worker.md CHANGED Viewed

@@ -45,6 +45,7 @@ Unlike the Codex / Gemini workers, you are an in-process Claude subagent — you
 4. Anchor all file operations to the absolute `Project Root` from the lead prompt. Use absolute paths — do NOT rely on inherited cwd. Never use `cd` to change directory.
    - **Executor exception (implementation phase only):** when this worker is dispatched as the `Executor` and the lead prompt provides an `EXECUTOR_WORKTREE_PATH` that differs from the session's inherited cwd, cwd-sensitive Bash commands (`cargo *`, `npm *`, `pnpm *`, `bun *`, `pytest`, `make *`, `go *`, language-toolchain test/build commands) MUST be prefixed with `cd <EXECUTOR_WORKTREE_PATH> && ` in the same Bash invocation — e.g. `cd /Users/.../worktrees/foo && cargo test -p bar`. Do NOT wrap the whole thing in `bash -lc "..."` or `bash -c "..."`; pass the chained command directly to the Bash tool so the leading `cd` token remains visible to the permission layer. The `cd` is scoped to the single Bash subshell and does not mutate the session's shell state, so this does not conflict with the "never use cd" rule above (which prevents the worker from drifting the session cwd across calls).
+   - **Executor coding-conventions preflight (BLOCKING, before your first `Edit` / `Write`):** when dispatched as the `Executor`, you MUST run the coding-conventions preflight defined in the executor sidecar (`prompts/profiles/_implementation-executor.md` → "Pre-implementation context exploration") before writing any code — detect each touched file's language and invoke the project's coding-conventions skill (`coding-preflight` when installed; it routes the matching `languages/<lang>.md` + `clean-code.md` + any hexagonal overlay), then state in one line which conventions apply. Subagents do NOT auto-trigger skills, so this is an explicit step you must perform; if no such skill is reachable in your runtime, degrade per that sidecar section (agnostic principles + project lint/convention files) — never skip the gate.
    - **Verifier QA-gate exception:** verifier roles MAY use the same `cd <WORKTREE> && <cmd>` shape when executing project-declared `qaCommands` (lint / format / typecheck / test) from `project.json`, since those commands are cwd-sensitive by nature. Outside the QA gate, verifiers still read with absolute paths only — do NOT use `cd` for file inspection.
    - **No extra chaining beyond `cd && cmd`:** the permission matcher only allows the exact two-segment shape `cd <PATH> && <single-command>`. Do NOT append additional pipes, semicolons, redirects, or `&&` chains — e.g. `cd ... && cargo test ... 2>&1 | tail -20; echo "exit:$?"` will trigger a permission prompt every dispatch because the trailing `| tail`, `; echo`, and `2>&1` tokens disqualify the prefix match against `Bash(cargo:*)`. Let Claude Code capture the full stdout/stderr and exit code natively — do not post-process with `tail`, `head`, or `echo "exit:$?"`. If output truncation is genuinely needed, run the command first and read the result in a separate tool call.

package/runtime/prompts/profiles/_implementation-executor.md CHANGED Viewed

@@ -18,6 +18,10 @@ until Phase 5 ends, then drop from active context for Phase 6/7.
 ## Pre-implementation context exploration (executor before first edit)
+- **Coding-conventions preflight (BLOCKING — runs before the first `Edit` / `Write`, and binds the TDD loop below):** load the applicable coding conventions for every language the diff will touch, then state in ONE line which conventions apply (e.g. `Applying TS + hexagonal overlay; domain at src/domains/*/domain/`). Lint/test green is necessary but NOT sufficient — self-mocked tests, interaction-only assertions, and untruthful names all pass a green pipeline; this gate is what keeps them out of the diff.
+  - **Language-specific rules load per situation — never inline them here.** Detect each touched file's language (extension / project manifest) and load the matching reference from the project's coding-conventions skill: `coding-preflight`, when installed, routes `languages/<lang>.md` (mock/spy API, idioms, test framework) + `clean-code.md` + any `architecture/*` overlay. For a ports-and-adapters / NestJS-hex layout (`domain/` + `ports/` + `adapters/`, `*.port.*`), load the hexagonal overlay too. This per-language split is the skill's job — the executor does not carry a multi-language block in context.
+  - **Language-agnostic principles that ALWAYS bind (the TDD loop below MUST satisfy them):** (1) no self-mocking of the SUT — stub/spy only injected collaborators, never the subject's own methods; (2) behavioral assertions on outcomes (return value, state, persisted rows, events, boundary calls) — never `toHaveBeenCalled*` on an internal helper as the only/primary assertion; (3) truthful names — a `get*` / `find*` that writes/inserts, or a name encoding the caller's use-case (`*ForInit`) or hiding a domain rule (`findValid*`), is a defect; (4) single-purpose functions ≤50 effective lines, plain-English readability.
+  - **Graceful degradation (end-user, or codex / gemini executor runtimes where no coding-conventions skill is reachable):** do NOT skip the gate — apply the agnostic principles above plus the project's own `CLAUDE.md` / `CONTRIBUTING` / formatter+lint config, and record `coding-conventions: skill-unavailable → applied <project rules + agnostic principles>` in the final report. Never claim a skill read that did not happen.
 - **Mandatory TDD loop**: BEFORE the first `Edit` or `Write` call, the executor MUST apply a red-green-refactor loop for every code change in this run. This is required; skipping it is a `contract-violated` outcome. This governs HOW each step is executed (failing test first → minimal implementation → refactor); it does not override the approved plan's WHAT/file scope.
   - Order of operations per plan step: (1) write/extend the test that captures the step's acceptance criterion and confirm it fails for the right reason, (2) commit the failing test (`test(<scope>): ...`), (3) implement the minimum change to make it pass, (4) commit the implementation (`feat|fix(<scope>): ...`), (5) refactor without changing behaviour and commit separately if any cleanup is made (`refactor(<scope>): ...`). The failing-then-passing transition between steps (2) and (4) is the `TDD evidence` required by the final report.
   - Doc-only / config-only / pure-rename steps that have no observable runtime behaviour are exempt from the failing-test requirement, but the executor MUST cite the exemption per step in the final report (`TDD exemption: <reason>`).

package/runtime/prompts/profiles/_implementation-verifier.md CHANGED Viewed

@@ -60,6 +60,20 @@ The worker result MUST contain a `Read-only command log` block listing every com
 The final report keeps both — executor's `Validation evidence` AND each verifier's `Read-only command log` — so reviewers can compare them line-by-line.
+### Static design & test-quality review (gate — runs after the command re-run, before the verdict)
+Re-running commands proves the diff *builds and passes*; it does NOT prove the diff is *well-designed*. Lint/test green is necessary but not sufficient — self-mocked tests, interaction-only assertions, and untruthful names all survive a green pipeline. This gate is the filter for exactly those defects, so the executor's design errors are caught here instead of in post-merge PR review. It is a real gate, not a checklist: it enumerates the full diff and a blocking hit forces `FAIL`.
+- **Scope (no silent sampling).** Enumerate every changed source/test file via `git diff --name-only <base>...HEAD` and review each one. Skipping a changed file silently is a `contract-violated` outcome. If a file's language has no reference and is not covered by the agnostic checks below, record `design-review skipped: <file> (language=<x> no reference)` — never pass it silently.
+- **Load the same conventions the executor used, per language.** For each touched language load the coding-conventions reference (`coding-preflight` `languages/<lang>.md` + `clean-code.md` + the hexagonal overlay when the layout matches); degrade to the agnostic checks below when no skill is reachable. The verifier does NOT inline language rules — it loads them per situation, identical to the executor preflight.
+- **Blocking checks (any hit → verdict `FAIL`, cited `path:line` + rule name, recommended fix recorded — the verifier does NOT apply it):**
+  - **Self-mocking:** a test for `Foo` stubs/spies a method on the `Foo` instance under test (`jest.spyOn(sut, ...)`, `spyOn(FooService.prototype, ...)` in `foo.*.spec.*`, `vi.mocked(sut)` + stub). Mocking injected collaborators is fine.
+  - **Interaction-only assertion:** a test whose only/primary assertion is `toHaveBeenCalled*` / `toHaveBeenCalledTimes` on an internal helper or a non-side-effecting collaborator, with no assertion on the returned value / resulting state / persisted row / emitted event.
+  - **Untruthful name:** a read-named function (`get*` / `find*` / `load*`) that writes/inserts/mutates; an adapter or repository name encoding the caller's use-case (`*ForInit`) or hiding a domain rule (`findValid*` / `findActive*`).
+  - **Hexagonal (only when the overlay is loaded):** business logic inside a port body; an adapter method that is not pure I/O (post-fetch JS filtering on domain state, domain-rule evaluation); a domain object declared outside the `domain/` boundary.
+- **Advisory findings (recorded as recommendations; verdict MAY still PASS):** function >50 effective lines, a single body mixing read+write stages, weak readability, a missing-but-non-critical outcome assertion. These land in the verifier result as `should-fix` / `nit` recommendations, not as a `FAIL`.
+- **Output.** Every finding — blocking or advisory — is a structured item in the verifier's worker result (`path:line`, rule, severity, suggested fix) so it carries into Phase 5.5 convergence and the final report. A blocking hit sets the verifier verdict to `FAIL` with the rule cited, using the same verdict machinery as the Discrepancy rule above. `Claude lead` MUST NOT silently downgrade a cited blocking finding to advisory during synthesis; an override requires a concrete cited reason, exactly as for the Discrepancy rule.
 ## All-verifier-failure policy
 If every verifier present in the resolved roster (`Claude verifier`, `Codex verifier`, and `Gemini verifier` when opted in) ends with a non-result terminal status (`timeout`, `error`, `not-run`) — i.e. zero independent verdicts were produced — the run MUST end with status `blocked` and route to a follow-up `error-analysis` run. `Claude lead` MUST NOT substitute its own verdict in place of the missing verifier outputs; synthesis requires at least one independent verifier's verdict. If one or more verifiers fail but at least one returns a verdict, the run proceeds with the surviving verdict(s) and the final report MUST explicitly notate which verifiers were unavailable, with the captured error / timeout evidence per failed verifier.