npm - okstra - Versions diffs - 0.40.0 → 0.42.0 - Mend

okstra 0.40.0 → 0.42.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "okstra",
-  "version": "0.40.0",
+  "version": "0.42.0",
   "description": "Multi-agent cross-verification orchestrator runtime + Claude Code skills.",
   "license": "MIT",
   "author": "devonshin",

package/runtime/BUILD.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "package": "0.40.0",
-  "builtAt": "2026-06-02T15:22:29.217Z",
+  "package": "0.42.0",
+  "builtAt": "2026-06-03T11:08:34.120Z",
   "repoRoot": "/home/runner/work/okstra/okstra"
 }

package/runtime/agents/workers/claude-worker.md CHANGED Viewed

@@ -45,6 +45,7 @@ Unlike the Codex / Gemini workers, you are an in-process Claude subagent — you
 4. Anchor all file operations to the absolute `Project Root` from the lead prompt. Use absolute paths — do NOT rely on inherited cwd. Never use `cd` to change directory.
    - **Executor exception (implementation phase only):** when this worker is dispatched as the `Executor` and the lead prompt provides an `EXECUTOR_WORKTREE_PATH` that differs from the session's inherited cwd, cwd-sensitive Bash commands (`cargo *`, `npm *`, `pnpm *`, `bun *`, `pytest`, `make *`, `go *`, language-toolchain test/build commands) MUST be prefixed with `cd <EXECUTOR_WORKTREE_PATH> && ` in the same Bash invocation — e.g. `cd /Users/.../worktrees/foo && cargo test -p bar`. Do NOT wrap the whole thing in `bash -lc "..."` or `bash -c "..."`; pass the chained command directly to the Bash tool so the leading `cd` token remains visible to the permission layer. The `cd` is scoped to the single Bash subshell and does not mutate the session's shell state, so this does not conflict with the "never use cd" rule above (which prevents the worker from drifting the session cwd across calls).
+   - **Executor coding-conventions preflight (BLOCKING, before your first `Edit` / `Write`):** when dispatched as the `Executor`, you MUST run the coding-conventions preflight defined in the executor sidecar (`prompts/profiles/_implementation-executor.md` → "Pre-implementation context exploration") before writing any code — detect each touched file's language and invoke the project's coding-conventions skill (`coding-preflight` when installed; it routes the matching `languages/<lang>.md` + `clean-code.md` + any hexagonal overlay), then state in one line which conventions apply. Subagents do NOT auto-trigger skills, so this is an explicit step you must perform; if no such skill is reachable in your runtime, degrade per that sidecar section (agnostic principles + project lint/convention files) — never skip the gate.
    - **Verifier QA-gate exception:** verifier roles MAY use the same `cd <WORKTREE> && <cmd>` shape when executing project-declared `qaCommands` (lint / format / typecheck / test) from `project.json`, since those commands are cwd-sensitive by nature. Outside the QA gate, verifiers still read with absolute paths only — do NOT use `cd` for file inspection.
    - **No extra chaining beyond `cd && cmd`:** the permission matcher only allows the exact two-segment shape `cd <PATH> && <single-command>`. Do NOT append additional pipes, semicolons, redirects, or `&&` chains — e.g. `cd ... && cargo test ... 2>&1 | tail -20; echo "exit:$?"` will trigger a permission prompt every dispatch because the trailing `| tail`, `; echo`, and `2>&1` tokens disqualify the prefix match against `Bash(cargo:*)`. Let Claude Code capture the full stdout/stderr and exit code natively — do not post-process with `tail`, `head`, or `echo "exit:$?"`. If output truncation is genuinely needed, run the command first and read the result in a separate tool call.

package/runtime/prompts/profiles/_implementation-executor.md CHANGED Viewed

@@ -18,10 +18,15 @@ until Phase 5 ends, then drop from active context for Phase 6/7.
 ## Pre-implementation context exploration (executor before first edit)
+- **Coding-conventions preflight (BLOCKING — runs before the first `Edit` / `Write`, and binds the TDD loop below):** load the applicable coding conventions for every language the diff will touch, then state in ONE line which conventions apply (e.g. `Applying TS + hexagonal overlay; domain at src/domains/*/domain/`). Lint/test green is necessary but NOT sufficient — self-mocked tests, interaction-only assertions, and untruthful names all pass a green pipeline; this gate is what keeps them out of the diff.
+  - **Language-specific rules load per situation — never inline them here.** Detect each touched file's language (extension / project manifest) and load the matching reference from the project's coding-conventions skill: `coding-preflight`, when installed, routes `languages/<lang>.md` (mock/spy API, idioms, test framework) + `clean-code.md` + any `architecture/*` overlay. For a ports-and-adapters / NestJS-hex layout (`domain/` + `ports/` + `adapters/`, `*.port.*`), load the hexagonal overlay too. This per-language split is the skill's job — the executor does not carry a multi-language block in context.
+  - **Language-agnostic principles that ALWAYS bind (the TDD loop below MUST satisfy them):** (1) no self-mocking of the SUT — stub/spy only injected collaborators, never the subject's own methods; (2) behavioral assertions on outcomes (return value, state, persisted rows, events, boundary calls) — never `toHaveBeenCalled*` on an internal helper as the only/primary assertion; (3) truthful names — a `get*` / `find*` that writes/inserts, or a name encoding the caller's use-case (`*ForInit`) or hiding a domain rule (`findValid*`), is a defect; (4) single-purpose functions ≤50 effective lines, plain-English readability.
+  - **Graceful degradation (end-user, or codex / gemini executor runtimes where no coding-conventions skill is reachable):** do NOT skip the gate — apply the agnostic principles above plus the project's own `CLAUDE.md` / `CONTRIBUTING` / formatter+lint config, and record `coding-conventions: skill-unavailable → applied <project rules + agnostic principles>` in the final report. Never claim a skill read that did not happen.
 - **Mandatory TDD loop**: BEFORE the first `Edit` or `Write` call, the executor MUST apply a red-green-refactor loop for every code change in this run. This is required; skipping it is a `contract-violated` outcome. This governs HOW each step is executed (failing test first → minimal implementation → refactor); it does not override the approved plan's WHAT/file scope.
   - Order of operations per plan step: (1) write/extend the test that captures the step's acceptance criterion and confirm it fails for the right reason, (2) commit the failing test (`test(<scope>): ...`), (3) implement the minimum change to make it pass, (4) commit the implementation (`feat|fix(<scope>): ...`), (5) refactor without changing behaviour and commit separately if any cleanup is made (`refactor(<scope>): ...`). The failing-then-passing transition between steps (2) and (4) is the `TDD evidence` required by the final report.
   - Doc-only / config-only / pure-rename steps that have no observable runtime behaviour are exempt from the failing-test requirement, but the executor MUST cite the exemption per step in the final report (`TDD exemption: <reason>`).
   - When the touched area has no existing test harness, the executor MUST stand up the minimum harness needed to host one regression test for this run rather than skipping TDD entirely. Record the harness-bootstrap step as an `Out-of-plan edit` if it is not in the plan.
+- **DB / IO / SQL changes require real execution — mock-only is NOT validation evidence:** when this run's diff touches DB/IO/SQL (ORM / query-builder code — sequelize / typeorm / prisma / knex / raw SQL — `*.repository.*`, model/entity files, `migrations/**`, `*.sql`, or any changed query string), a mocked unit test cannot observe the SQL the query builder actually emits — a mocked suite once passed while `count({ col: 'FontFamily.fontFamily' })` threw `Unknown column` on the real DB. The executor MUST run the change against a real (or faithful-replica) datastore — the `db-test` validation step (plan `validation` db step, else `project.json.qaCommands.db-test`), targeting a **local / replica** DB — and cite its exact command + exit code in the final report's `Validation evidence`. If no real DB / `db-test` command is reachable, do NOT claim the change verified: label the DB portion `정적 분석상 …, 미검증(실행 안 함)` in the report, surface it in the routing recommendation, and never downplay the real run as "too heavy". `git push` stays forbidden (universal list); the unverified DB state is carried forward so `final-verification` cannot accept it and `release-handoff` cannot push.
 - re-read the approved plan end-to-end and parse the `## 4.5 Stage Map`. Determine **start stage**:
   - if `--stage <N>` is supplied, use N. Otherwise auto = the lowest stage number whose `depends-on` are all recorded as `status:done` in `runs/<plan-key>/consumers.jsonl` AND that itself has no `status:done` row. Multiple stages may match — two parallel `implementation` runs may pick different ones and proceed concurrently.
   - load every `runs/<plan-key>/carry/stage-<i>.json` for `i ∈ depends-on(start_stage)` and inject them into the executor's working context as "runtime carry-in". For `depends-on (none)` stages, no sidecar load — task-brief only.

package/runtime/prompts/profiles/_implementation-verifier.md CHANGED Viewed

@@ -30,7 +30,8 @@ Verifier obtains the QA command set from exactly two declared sources, in order
        "lint":      [{ "label": "cargo clippy", "cmd": "cargo clippy --all-targets -- -D warnings", "language": "rust" }],
        "format":    [{ "label": "cargo fmt",    "cmd": "cargo fmt --check",                          "language": "rust" }],
        "typecheck": [{ "label": "tsc",          "cmd": "pnpm exec tsc --noEmit",                     "language": "ts"   }],
-       "test":      [{ "label": "cargo test",   "cmd": "cargo test --workspace --locked",            "language": "rust" }]
+       "test":      [{ "label": "cargo test",   "cmd": "cargo test --workspace --locked",            "language": "rust" }],
+       "db-test":   [{ "label": "db integ",     "cmd": "pnpm test:db",                               "language": "ts"   }]
      }
    }
    ```
@@ -42,7 +43,7 @@ Tier 1 commands run verbatim first. Then every Tier 2 entry runs once. Each comm
 ### Missing-tier handling
-If a tier is empty or absent, verifier records the single line `qa-command not configured: <category>` per missing category (`lint` / `format` / `typecheck` / `test`) in the worker result and proceeds — silent omission is a contract violation. Verifier MUST NOT auto-detect or invent a command in this case; the user/operator must declare it in `project.json.qaCommands` or in the plan.
+If a tier is empty or absent, verifier records the single line `qa-command not configured: <category>` per missing category (`lint` / `format` / `typecheck` / `test`; and `db-test` **only when the diff touches DB/IO/SQL**, where a missing `db-test` is escalated to a blocking finding per the DB real-execution gate below — not a passive note) in the worker result and proceeds — silent omission is a contract violation. Verifier MUST NOT auto-detect or invent a command in this case; the user/operator must declare it in `project.json.qaCommands` or in the plan.
 ### `cmd` field deny-list (Tier 2 validation)
@@ -60,6 +61,30 @@ The worker result MUST contain a `Read-only command log` block listing every com
 The final report keeps both — executor's `Validation evidence` AND each verifier's `Read-only command log` — so reviewers can compare them line-by-line.
+### Static design & test-quality review (gate — runs after the command re-run, before the verdict)
+Re-running commands proves the diff *builds and passes*; it does NOT prove the diff is *well-designed*. Lint/test green is necessary but not sufficient — self-mocked tests, interaction-only assertions, and untruthful names all survive a green pipeline. This gate is the filter for exactly those defects, so the executor's design errors are caught here instead of in post-merge PR review. It is a real gate, not a checklist: it enumerates the full diff and a blocking hit forces `FAIL`.
+- **Scope (no silent sampling).** Enumerate every changed source/test file via `git diff --name-only <base>...HEAD` and review each one. Skipping a changed file silently is a `contract-violated` outcome. If a file's language has no reference and is not covered by the agnostic checks below, record `design-review skipped: <file> (language=<x> no reference)` — never pass it silently.
+- **Load the same conventions the executor used, per language.** For each touched language load the coding-conventions reference (`coding-preflight` `languages/<lang>.md` + `clean-code.md` + the hexagonal overlay when the layout matches); degrade to the agnostic checks below when no skill is reachable. The verifier does NOT inline language rules — it loads them per situation, identical to the executor preflight.
+- **Blocking checks (any hit → verdict `FAIL`, cited `path:line` + rule name, recommended fix recorded — the verifier does NOT apply it):**
+  - **Self-mocking:** a test for `Foo` stubs/spies a method on the `Foo` instance under test (`jest.spyOn(sut, ...)`, `spyOn(FooService.prototype, ...)` in `foo.*.spec.*`, `vi.mocked(sut)` + stub). Mocking injected collaborators is fine.
+  - **Interaction-only assertion:** a test whose only/primary assertion is `toHaveBeenCalled*` / `toHaveBeenCalledTimes` on an internal helper or a non-side-effecting collaborator, with no assertion on the returned value / resulting state / persisted row / emitted event.
+  - **Untruthful name:** a read-named function (`get*` / `find*` / `load*`) that writes/inserts/mutates; an adapter or repository name encoding the caller's use-case (`*ForInit`) or hiding a domain rule (`findValid*` / `findActive*`).
+  - **Hexagonal (only when the overlay is loaded):** business logic inside a port body; an adapter method that is not pure I/O (post-fetch JS filtering on domain state, domain-rule evaluation); a domain object declared outside the `domain/` boundary.
+- **Advisory findings (recorded as recommendations; verdict MAY still PASS):** function >50 effective lines, a single body mixing read+write stages, weak readability, a missing-but-non-critical outcome assertion. These land in the verifier result as `should-fix` / `nit` recommendations, not as a `FAIL`.
+- **Output.** Every finding — blocking or advisory — is a structured item in the verifier's worker result (`path:line`, rule, severity, suggested fix) so it carries into Phase 5.5 convergence and the final report. A blocking hit sets the verifier verdict to `FAIL` with the rule cited, using the same verdict machinery as the Discrepancy rule above. `Claude lead` MUST NOT silently downgrade a cited blocking finding to advisory during synthesis; an override requires a concrete cited reason, exactly as for the Discrepancy rule.
+### DB / IO / SQL change — real-execution gate (mock-only acceptance forbidden)
+A mocked unit test cannot observe the SQL a query builder actually emits — `count({ col: 'FontFamily.fontFamily' })` passes a mocked suite yet throws `Unknown column` on a real database. For this class of change a green mock-only suite is therefore NOT evidence; only a run against a real (or faithful-replica) datastore is. This gate is the verifier's enforcement of that rule.
+- **Trigger.** Fires when `git diff <base>...HEAD` touches DB/IO/SQL: ORM / query-builder code (sequelize / typeorm / prisma / knex / raw SQL), `*.repository.*`, model/entity files, `migrations/**`, `*.sql`, or any changed query string.
+- **Requirement when fired.** The verifier MUST reproduce a real-DB execution: run the `db-test` tier (Tier 1 = plan `validation` db step; else Tier 2 = `project.json.qaCommands.db-test`) against a **local / replica** datastore (same engine + schema — never shared / staging / prod, consistent with the verifier forbidden-actions list) and record its exact command + exit code. A mock, an in-memory shim that does not parse real SQL, or static reasoning does NOT satisfy this.
+- **No `db-test` command available → blocking, not a passive skip.** If neither tier declares a `db-test` command, the verifier records the blocking finding `db-test not configured — DB change unverified (mock-only)` and sets the verdict to `FAIL`; it MUST NOT emit only the passive `qa-command not configured` note and pass. Recommended fix: declare a `db-test` command in `project.json.qaCommands` or the plan's validation set.
+- **Mock-only evidence → unverified.** If the diff's only DB coverage is mocked, the verifier labels the DB portion `정적 분석상 …, 미검증(실행 안 함)` (never `검증됨`), records it as a blocking finding, and sets `FAIL`. Never downplay the real run as "too heavy / static proof suffices".
+- **Surface it at every layer.** The finding is copied verbatim into the verifier result and MUST survive into the final report's `## 1.` and Verdict Card, so the user sees the DB-unverified state continuously — it is the load-bearing reason a downstream `final-verification` cannot reach `accepted` and `release-handoff` cannot push.
 ## All-verifier-failure policy
 If every verifier present in the resolved roster (`Claude verifier`, `Codex verifier`, and `Gemini verifier` when opted in) ends with a non-result terminal status (`timeout`, `error`, `not-run`) — i.e. zero independent verdicts were produced — the run MUST end with status `blocked` and route to a follow-up `error-analysis` run. `Claude lead` MUST NOT substitute its own verdict in place of the missing verifier outputs; synthesis requires at least one independent verifier's verdict. If one or more verifiers fail but at least one returns a verdict, the run proceeds with the surviving verdict(s) and the final report MUST explicitly notate which verifiers were unavailable, with the captured error / timeout evidence per failed verifier.

package/runtime/prompts/profiles/final-verification.md CHANGED Viewed

@@ -14,6 +14,7 @@
     - delivered artifacts match recorded expected values in `reference-expectations` (config files, deployment manifests, other recorded expected states); when reference-expectations are absent, record it as missing information rather than assuming a match
     - test & validation suite pass status — independently re-run the read-only two-tier command set (Tier 1 = brief/approved-plan `validation`, Tier 2 = `project.json` `qaCommands`) and confirm each passes on the verified head, citing exact command + exit code
     - test correctness — delivered tests actually assert the intended behaviour: no gutted/weakened assertions, no tautological or always-passing tests, no tests exercising only mocks; new behaviour has matching coverage
+    - DB / IO / SQL real-execution evidence — when the diff touches DB/IO/SQL (ORM / query-builder, `*.repository.*`, model / `migrations/**` / `*.sql`, or changed query strings), Validation Evidence MUST cite a real (or faithful-replica) DB execution — the `db-test` command + exit code — not a mock-only suite, because a mocked suite cannot observe the SQL actually emitted (`count({ col: 'FontFamily.fontFamily' })` passed mocks yet threw `Unknown column` on the real DB). A DB-touching change whose only evidence is mocked, or for which no `db-test` ran, is an **Acceptance Blocker** (`major`+): record it, and since `accepted` requires zero blockers the verdict becomes `conditional-accept` / `blocked`. This is the gate that stops an unverified DB change from reaching `release-handoff` and being pushed.
     - no new defects introduced — the diff does not break previously-working behaviour and adds no new bug (logic/off-by-one, null/empty handling, resource leaks, broken error paths)
     - scope conformance — the delivered diff stays within the approved plan's scope; flag out-of-scope edits, unrelated file changes, leftover debug/commented-out code, and unintended deletions
   - Residual-tracked — note as Residual Risk unless severe enough to block:

package/runtime/python/okstra_ctl/qa_commands.py CHANGED Viewed

@@ -20,7 +20,11 @@ import re
 from typing import Iterable
 # 카테고리 화이트리스트. 알 수 없는 카테고리는 오타 가능성이 높으므로 거부.
-ALLOWED_CATEGORIES: tuple[str, ...] = ("lint", "format", "typecheck", "test")
+# `db-test` 는 DB/IO/SQL 변경의 실제 DB(또는 충실한 복제) 실행 테스트 전용 카테고리 —
+# mocked 단위테스트로는 query builder 가 실제로 emit 하는 SQL 을 관측할 수 없으므로
+# `test` 와 분리한다. implementation verifier / final-verification 의 DB 실제실행 게이트가
+# diff 가 DB 를 건드릴 때 이 카테고리(또는 plan validation 의 db 스텝)를 요구한다.
+ALLOWED_CATEGORIES: tuple[str, ...] = ("lint", "format", "typecheck", "test", "db-test")
 # Mutation 을 유발하거나 lockfile 을 갱신하는 토큰. 각 토큰은 `cmd` 문자열을
 # 공백으로 단순 분해한 결과 또는 부분 일치 패턴(prefix/suffix sensitive) 로 검출한다.