npm - academic-army - Versions diffs - 0.2.0 → 0.3.0 - Mend

academic-army 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

package/README.md +12 -17
package/README.zh-CN.md +12 -17
package/agent-forge.yaml +95 -66
package/mcp-server/__main__.py +2 -0
package/mcp-server/writing_master/__init__.py +3 -0
package/mcp-server/writing_master/tools.py +33 -0
package/metaskills/README.md +0 -3
package/metaskills/README.zh-CN.md +0 -3
package/package.json +4 -3
package/runs/develop-skill.sh +2 -2
package/runs/develop.sh +2 -2
package/skills/academic-army-coding-style/SKILL.md +668 -6
package/skills/academic-army-literate-latex-writing/SKILL.md +61 -0
package/skills/academic-army-literate-latex-writing/agents/openai.yaml +11 -0
package/metaskills/academic-army-repo-scaffold/ENVOLVETASK.md +0 -1
package/metaskills/academic-army-repo-scaffold/METASKILL.md +0 -223
package/metaskills/academic-army-repo-scaffold/envolve.sh +0 -9
package/skills/academic-army-repo-scaffold/SKILL.md +0 -756
package/skills/academic-army-repo-scaffold/agents/openai.yaml +0 -10

package/skills/academic-army-coding-style/SKILL.md CHANGED Viewed

@@ -18,9 +18,9 @@ how to keep the implementation readable, local, low-coupling, testable, and
 consistent with the current framework.
 Do not use this skill to initialize a repository template or recreate a project
-scaffold from an empty directory. Template initialization belongs to a separate
-skill. This skill may add files, modules, tests, harness support, or docs only
-when the current task and current repository need them.
+scaffold from an empty directory. Repository initialization is out of scope. This
+skill may add files, modules, tests, harness support, or docs only when the
+current task and current repository need them.
 ## Operating Boundary
@@ -69,21 +69,133 @@ Before editing, establish a small task-relevant inventory:
 - files and directories relevant to the requested change;
 - expected source, test, harness, export, docs, and dependency surfaces;
 - files that must be left untouched by scope;
+- any explicit allowed-file list or explicit excluded surfaces from the user
+  request, treated as a hard scope fence;
 - existing test and harness layout when relevant;
 - current dirty or untracked files, without reverting user work;
+- **import surface**: when a module will be moved, renamed, or deleted, search
+  every import form that touches it — `from .<module> import`,
+  `from <package>.<module> import`, `from <package> import <module>`, and any
+  indirect imports through package `__init__.py` — so the full consumer set is
+  known before the first edit;
 - accepted constructor fields, identity fields, validation owner, provenance
-  fields, and export surfaces for record-backed helpers.
+  fields, and export surfaces for record-backed helpers;
+- accepted callable signatures, default values, aggregation or identity keys,
+  empty-input behavior, non-mutation expectations, provenance expectations, and
+  regression tests that must not be weakened;
+- task-stated current-state claims — what the task says already exists, is
+  missing, is on or off a path, or has or has not run — reconciled against the
+  actual worktree before acting, with contradictions surfaced as the first
+  finding.
 Treat a suddenly empty or partially missing tree as an integrity blocker. Do not
 reconstruct missing code from memory, plans, reports, or old outputs unless the
 user asks for restoration from a trusted source.
+Treat the task's stated current-state as a claim to verify, not as ground truth.
+Before acting, check each factual premise against the worktree: the paths,
+importability, or artifacts the task asserts as present or absent. When the
+worktree contradicts the task — artifacts the task calls absent already exist, a
+module the task calls off-path is importable, a run the task says never happened
+already produced outputs — surface the contradiction first and re-scope the
+remaining work to the real gap. Do not proceed on a stale premise to manufacture
+work the worktree already satisfies, and do not overwrite or ignore existing
+artifacts to match the task's description. Accepted work is what the real gap
+requires, not what the task's narrative implies.
+When the worktree already satisfies the task's stated objective (the target
+package exists, the files are in place, the shims are written, the structure
+matches the refactor plan), the task is an **already-complete-on-disk** case.
+The correct operational sequence is verification, not re-implementation:
+1. Run the scoped verification the task requests (test suite, import check,
+   whitespace check).
+2. When verification passes, the primary deliverable is updating memory and
+   trajectory files to record the verified state — flipping the phase status,
+   recording exact test counts, shim identities, and import findings, and
+   advancing the stale selection pointer.
+3. Do not re-execute the structural work (re-create files, re-split modules,
+   re-write shims) just because memory says it was never done. Memory is the
+   stale surface; the worktree is the source of truth.
+4. After updating memory, the next-slice pointer should move past this phase
+   to the next real gap. List candidate next phases without selecting one
+   unless the task or active workflow explicitly selects it.
+This pattern is common in phased refactors where work was done in a prior
+session but memory was never updated. The no-op guard in memory files (e.g.
+"N consecutive no-op rounds") exists precisely to detect and break this cycle:
+when memory says "not started" but the worktree says "done," verify and record,
+do not re-execute.
+If the request names an exact allowed-file set, edit only those files. Do not
+touch package entrypoints, export tests, docs, registries, harnesses, artifact
+writers, TODO or memory files, generated outputs, or adjacent modules unless
+they are explicitly in the allowed set. Moving an out-of-scope file into a
+temporary, stash, backup, or memory folder inside the repository still changes
+the repository and does not satisfy the scope fence. Leave unrelated untracked
+or dirty files alone; if they break imports or validation, report the blocker or
+ask for an explicit scope expansion instead of repairing them under the current
+task.
+Do not self-expand the allowed-file list. A developer report, rationale table,
+or "scope updated" note does not change the task scope by itself. Treat scope as
+expanded only when the active user instruction or controlling task definition
+actually supplies the expanded file list or explicitly authorizes the adjacent
+contract work. Until then, out-of-scope fixes remain out of scope even if they
+would make validation pass.
+Guard suites and accepted integration surfaces are validation surfaces, not
+edit surfaces. If a guard imports or exercises an out-of-scope module, that does
+not make the module editable. Preserve it as accepted baseline unless it is in
+the allowed-file list. If stale wiring in that module blocks validation, first
+try to satisfy the contract from the scoped owner; otherwise report a
+validation-scope conflict and the smallest needed scope expansion.
+If an explicit capability ban conflicts with the allowed-file list because a
+live prohibited surface already exists outside that list, do not ignore it and
+do not declare the task stable. Report the scope conflict before editing, or
+remove it only when the user or accepted review explicitly makes that surface a
+cleanup target. The conflict should be visible in the baseline, not discovered
+only after validation fails.
+When a capability category is explicitly excluded, build a short removal or
+absence checklist for every place that category can live in the current repo:
+source files, module entrypoints, package metadata, parser or handler functions,
+tests, docs, examples, and generated or helper files. "Removed" means absent
+from the filesystem and changed-file list, not replaced by a comment-only stub,
+empty placeholder, renamed backup, or disabled test that still sits inside the
+repo. Search the excluded command names, module names, and user-facing phrases
+after cleanup rather than relying on memory of earlier edits.
+Also inventory aliases for the excluded capability: compatibility wrappers,
+re-export modules, alternate file names, embedded handler functions inside
+otherwise legitimate modules, package-script metadata, and tests named after
+the command rather than the original module. Removing only the first obvious
+file is incomplete when another path still exposes the same capability.
+If you must undo your own accidental out-of-scope edits, make the smallest
+surgical removal needed to restore the prior surface. Do not rewrite whole
+entrypoints, export tests, docs, or config files as a cleanup shortcut, because
+that can erase unrelated user work and expand the diff beyond the task.
+For cleanup tasks, make a preservation map before deleting anything: files or
+symbols explicitly requested for removal, files that only need references
+removed, validation targets that must still exist, and accepted work that must
+survive unchanged. Never delete required validation targets, accepted feature
+modules, or their tests just because they depend on a stale excluded import.
+Fix the stale reference at its owner, or report a scope conflict if the owner is
+outside the allowed files.
 ## Task Classification
 Classify the task before editing:
 - **Feature or implementation**: add the smallest clear code path that satisfies
   the requested behavior.
+- **Stabilization or acceptance**: compare the current draft against the
+  accepted contract before deciding no edits are needed. Passing tests alone is
+  not enough; verify signatures, defaults, key derivation, boundary behavior,
+  non-mutation/provenance requirements, docs wording, and the tests that prove
+  those behaviors.
 - **Refactor or cleanup**: move, split, merge, rename, or delete code only to
   improve locality, readability, or testability for the current change.
 - **Harness work**: keep harness code under the relevant `harness/` area; make
@@ -93,10 +205,111 @@ Classify the task before editing:
 - **Method, baseline, metric, or export work**: keep the change near the owning
   extension point and update registration, docs, exports, and tests only when
   those surfaces are in scope.
+- **Bounded bridge or helper work**: keep the callable in its owning module and
+  avoid broadening public package exports, registries, docs, CLIs, harnesses, or
+  artifact surfaces unless the request explicitly includes those surfaces.
+- **Bounded runtime adapter stabilization**: keep discovery, command-plan
+  construction, preflight, staging, and dry-run behavior in the owning adapter
+  modules. Tests should use small local fixtures, temporary paths,
+  monkeypatched imports or subprocesses, and explicit dry-run/preflight checks.
+  Do not add real runtime execution, package exports, CLIs, harness runners,
+  registries, generated paper artifacts, or broader integration behavior unless
+  the request explicitly scopes them. When the request does explicitly scope real
+  execution, treat that scope as a fence: run only the bounded slice it names,
+  keep adjacent surfaces forbidden, and report a missing-input, preflight, or
+  wrong-environment blocker rather than widening the command or carrying the
+  authorization into a neighboring slice.
+  When the task authorizes "any minimal fix needed" to make real execution
+  complete, a config-only change (adding a path entry, an executable
+  override, or a flag value to an existing config field that the adapter
+  already accepts) stays inside the fence. It does not widen the command,
+  add new source code, or create new adapter surfaces. Distinguish this from
+  adding a new CLI flag, registry entry, or module export — those widen the
+  surface and remain forbidden unless explicitly scoped.
+  Before proceeding from dry-run to real execution, verify that the
+  subprocess environment matches the in-process preflight environment.
+  In-process import checks (``command_module_errors``,
+  ``python_module_errors``) resolve imports with the current Python
+  interpreter and ``sys.path``. A subprocess launched via
+  ``subprocess.run(["python", ...])`` may invoke a different Python
+  interpreter (the first ``python`` on ``PATH``) that lacks packages
+  installed only in the project venv. When preflight passes but the
+  subprocess fails with an import error or exit code 1, the first
+  diagnostic is: "is the subprocess running the same Python?" Test the
+  actual subprocess command manually before concluding the adapter logic
+  is wrong. The fix — specifying the venv Python as the plan's
+  ``python_executable`` — is a config-only correction, not a command
+  widening.
+  When transitioning a config from smoke/dry-run to real execution, audit
+  which run spec each runtime plan is attached to. Smoke configs sometimes
+  attach plans to a convenience spec (e.g. the first spec, or the one
+  without deferred loading) because dry-run never consumes the outputs.
+  Real execution may require those plans on a *different* spec — the one
+  whose scheduler delivers the media objects that the generated artifacts
+  are meant to annotate. The diagnostic question is: "does the downstream
+  consumer of these runtime outputs receive them from the same run spec?"
+  If a quality-replacement function reads ``reference_results`` from the
+  current run but the reference objects are only delivered by a different
+  scheduler on a different spec, the bridge silently never fires. Read the
+  consumer (the quality-adjustment or annotation function that reads the
+  runtime results) to confirm it receives
+  the results from the right execution context before concluding the config
+  is correct.
+  When a downstream quality or annotation function contains a conditional
+  early-return (e.g. "only adjust if delivered objects include
+  reference-type media"), verify that the condition actually fires during
+  real execution. A "no reference objects in frame" pass-through is
+  functionally silent — the run succeeds but the measured-quality bridge
+  never activates. After the first real run, inspect the frame outcomes of
+  the relevant frames to confirm the adjusted fields differ from what the
+  same run would produce with ``execute_runtime_plans=false``. If they are
+  identical, the bridge did not fire; diagnose the condition (missing
+  reference objects, frame mapping gap, missing CSV) rather than accepting
+  a green run at face value.
+  When replay steps (controlled by trace length parameters such as
+  ``camera_limit``, ``bandwidth_limit``) and media-object frame ranges
+  (controlled by ``frame_start``/``frame_end`` in options) are configured
+  independently, verify that the replay covers the frames for which
+  runtime-generated artifacts exist. If artifacts target frame *N* but the
+  replay only executes up to frame *N-1*, the artifacts will never be
+  delivered and any quality bridge depending on them will silently skip.
+  The fix is a config adjustment (extend the trace limits), not a runner
+  change.
+  When a task requires re-running an experiment whose pipeline contains
+  non-deterministic components (stochastic training with few epochs, random
+  initialization, GPU non-determinism), the re-run will produce different
+  numeric metrics even though the mechanism is unchanged. A task constraint
+  like "the re-run must reproduce the accepted quality_score=<exact value>"
+  may be physically unsatisfiable if the pipeline variance exceeds the
+  tolerance implied by the constraint. Before accepting such a constraint
+  at face value, check whether the pipeline is deterministic — if it is
+  not, verify that the *mechanism* (parameter flow, quality bridge,
+  computation formula) survived the re-run, and report that exact
+  metric-level reproduction is not guaranteed. Do not try to force
+  determinism through seed pinning or training changes unless the task
+  explicitly scopes them.
+- **Surface cleanup**: remove only the excluded surface and its named wiring.
+  Preserve validation-target modules, accepted features, and unrelated behavior.
+  If deleting one excluded file breaks imports, remove that file's public wiring
+  rather than deleting the importer, the validation target, or the larger
+  subsystem that was supposed to remain.
 - **Validation-only pass**: run the exact requested command from the repository
-  root. If it passes, make no source, test, docs, dependency, export, or TODO
+  root. If it passes, make no source, test, docs, dependency, or export
   changes except removing artifacts created by the run. If it fails, inspect
   the failure and make only the smallest local fix to the accepted contract.
+  Memory and trajectory files (progress records, status snapshots, validation
+  logs, known-gap lists) are **not** source/test/docs/export changes — they
+  are trajectory surfaces that record verified facts. When the task explicitly
+  scopes memory-file updates as part of a validation pass, those updates are
+  the primary deliverable, not a scope violation. Distinguish this from
+  inventing new TODO items or selecting next tasks — recording verified state
+  is maintenance, not feature work.
 - **Framework or docs sync**: update framework docs when module boundaries,
   extension points, harness/test organization, artifact schemas, or repository
   responsibilities change and docs are in scope or are part of the accepted
@@ -148,7 +361,9 @@ Keep responsibilities single:
 Prefer inline or local helpers when logic is used once and remains readable.
 Extract helpers, adapters, registries, factories, contexts, or interfaces only
 when they provide real reuse, isolate a stable boundary, preserve an invariant,
-reduce caller code, or make tests simpler.
+reduce caller code, make tests simpler, or when a refactor plan separates
+conceptually distinct concerns into separately named modules (even if the
+helpers are currently only called from one file).
 When reusing an existing private helper for a broader case, first check whether
 the helper name, parameters, and doc-adjacent wording still describe every
@@ -164,6 +379,174 @@ Reduce global state, hidden path assumptions, implicit side effects, long call
 chains, repeated registration points, and heavy configuration for simple
 experiments.
+### Staged Package Migration (Re-Export Shim)
+When a phased refactor moves a module to a new canonical location, leave the
+old path as a thin re-export shim rather than hunting down and rewriting every
+import site at once. The shim keeps the old module file but replaces its entire
+body with a single import that re-exports the public surface from the new
+location. Every existing `from <old> import X` or `from .<old> import X`
+continues to resolve without change; consumers migrate to the new path in later
+phases at their own pace.
+A proper shim:
+- imports every public name (classes, functions, type aliases) the old module
+  previously exported, so no consumer sees a missing name;
+- also re-exports any name that tests access as a module attribute for
+  monkeypatching (`monkeypatch.setattr(module, "name", ...)`), even if no
+  consumer does `from module import name` — a monkeypatch target that is absent
+  from the shim will fail with `AttributeError`;
+- does not import private helpers unless a consumer is confirmed to depend on
+  them — but when in doubt, re-export the full surface to be safe;
+- has no logic, no side effects, and no new imports beyond the re-export line;
+- lives as a temporary bridge; the shim is deleted once all consumers have
+  migrated to the canonical path.
+Before writing a shim, inventory the full import surface of the module being
+moved: search for `from <package>.<module> import`, `from .<module> import`,
+`from <package> import <module>`, and any indirect imports through package
+`__init__.py`. Confirm every name that any consumer imports is present in the
+shim. After writing the shim, run the full test suite — shim fidelity is the
+central risk, and a missing re-export will surface as `ImportError` in
+consumers.
+After the move, verify import direction: the new canonical module should not
+accidentally import through the shim (creating a circular path), and layered
+packages should only import from layers below them.
+Also verify import **depth**: when code moves from a flat package into a
+subpackage N levels deeper (e.g. flat → `<pkg>/sub1/sub2/`), relative imports
+to sibling top-level packages need N extra dots. From `<pkg>.sub1.sub2`, one
+dot is `<pkg>.sub1.sub2`, two dots is `<pkg>.sub1`, three dots is `<pkg>`.
+Count from the new file's package position up to the common ancestor, then down
+to the target. A smoke test that imports every moved module immediately after
+creation catches depth errors before the test suite runs.
+Also verify **monkeypatch continuity** when the moved code calls a function that
+tests monkeypatch through the old module's namespace. After the split, the new
+canonical module has its own import binding for that function — independent of
+the shim's binding — so a `monkeypatch.setattr(shim, "name", fake)` does not
+affect the new module's call sites. When this pattern exists, make the new
+module look up the function through the shim at call time (e.g. `import
+<old_module> as _shim` and call `_shim.<name>(...)`) rather than via a direct
+bare-name import. This keeps the monkeypatch-able namespace as the single
+point of indirection without changing any logic.
+This pattern applies when the monkeypatch target is a **refabr-internal
+function binding** — a function defined inside the repository that the
+canonical module imports via `from <module> import <func>`. A different
+mechanism applies when the monkeypatch target is a **shared stdlib singleton**
+(e.g. `importlib`, `os`, `sys`). Stdlib modules are singletons cached in
+`sys.modules` — every module that does `import importlib.util` receives a
+reference to the same `importlib` object. When a test patches
+`<shim>.importlib.util.find_spec`, it mutates `sys.modules['importlib'].util.find_spec`,
+and the canonical module's own `importlib.util.find_spec(...)` calls see the
+same mutation because both reference the same shared singleton. In this case
+**no module-level-alias fix is needed** — the shim only needs to re-export
+the stdlib module binding (e.g. `from <canonical> import importlib`) so that
+`<shim>.<stdlib_module>` resolves. Before applying either fix, determine
+which mechanism is in play: test `import sys; <canonical>.<module> is
+sys.modules['<module>']`. If true, it's shared-singleton propagation and needs
+only shim re-export. If false, it's a refabr function binding and needs the
+module-level-alias fix.
+Also verify **transitive import direction** — not just what the new module
+directly imports, but what those imports drag in. A module in a restricted
+layer (e.g. a summarizer that must not touch runtime) may import only result
+dataclasses from a higher layer — semantically clean. But Python executes every
+module-level `import`/`from` statement of the imported module, so if that
+higher-layer module imports runtime adapters, scheduler registries, or config
+registries at module level, the summarizer transitively pulls them all in. This
+is a transitive import violation even though the summarizer's own `import` lines
+look correct. Audit transitive chains by following each import target module's
+own module-level imports recursively until reaching a layer boundary. When a
+transitive violation exists, the narrow fix is to extract the needed result types
+into a dedicated leaf module that only imports from permitted layers (typically
+domain), then have both the higher-layer module and the summarizer import from
+that leaf. When such extraction is out of scope, record the pre-existing
+violation in memory and defer it — do not silently accept it as clean.
+When a phased refactor creates a package directory that shares its name with
+an existing flat module (e.g. creating `metrics/` when `metrics.py` already
+exists in the same directory), Python resolves `import <pkg>.<name>` to the
+package directory, making the flat file unreachable under its original import
+path. Before creating the package, check for a same-name flat module. If one
+exists, absorb its public surface into the package's `__init__.py` and delete
+the flat file from the working tree. The absorbed content must be byte-identical
+to the original — no logic changes. Consumers that imported from the flat
+module (e.g. `from <pkg>.<name> import <symbol>`) will now resolve through the
+package `__init__.py`; verify that every public name the flat module exported
+appears in the `__init__.py`. If the flat module also has consumers that import
+it as a standalone object (`import <pkg>.<name> as <alias>`), those consumers
+resolve to the package module (`<pkg>/<name>/__init__.py`) transparently.
+A `git rm` of the flat file is the cleanest removal — do not leave a `.bak`
+rename, a comment-only stub, or an empty placeholder that would still shadow
+the package in some Python import resolution orders.
+When extending a package's `__init__.py` to re-export a newly added sibling
+module, check for circular imports before adding the re-export line. If the
+new sibling module imports from another sibling (e.g. `<new>.py` does
+`from ..<old_shim> import <name>`), and the `__init__.py` already re-exports
+from that other sibling (or from the shim that resolves to it), adding
+`from .<new> import ...` to `__init__.py` creates a circular chain:
+`__init__` → `.<new>` → `..<old_shim>` → `__init__`. When this is detected,
+leave the `__init__.py` incomplete — the new sibling remains accessible
+through its root re-export shim, not through the package `__init__.py`.
+Document the deferred `__init__.py` re-export as a known gap for a future
+dedicated decoupling pass. Do not reorder imports, add lazy imports, or
+restructure sibling dependencies to force the `__init__.py` re-export — those
+are behavior-sensitive changes that belong to a later phase.
+### Domain-Local Extraction vs. Cross-Module Dedup
+When a task asks to extract shared helpers, distinguish two operations with
+very different risk profiles:
+**Domain-local extraction** — moving private helpers from one module to a
+sibling within the same package. The helpers already live in the module that
+owns them; the extraction just gives each concern its own named file. Call
+sites stay in the same file(s); the only change is an import line switching
+from local `def` to `from .sibling import`. Risk is low because the contract
+does not expand to new callers.
+**Cross-module dedup** — deleting copy-pasted helpers from many modules and
+having them all import from a single canonical source. Risk is high when the
+copies are not byte-identical: subtle differences in whitespace handling,
+signature types, edge-case behavior, or sibling-helper availability mean a
+mechanical unification silently changes validation behavior for some modules.
+Before attempting cross-module dedup, inventory every copy and compare their
+bodies. If copies differ, do not unify them under the current task — defer to
+a dedicated reconciliation pass that can assess each behavioral difference
+individually. The skill's rule "do not relax a shared validator; keep it true
+for every caller" applies across modules too: a validator that changes behavior
+for even one caller is a different validator.
+When the task explicitly scopes only domain-local extraction and names
+cross-module copies as deferred, do not touch those copies even if they appear
+to share the same name. Record them as deferred work and move on.
+Audit path-resolution chains when a value passes through multiple resolution
+layers before reaching its final use site. A path written in a config file
+may be resolved relative to the config directory by a config parser, stored
+as a raw string in an internal record, and later resolved again relative to
+the process CWD by a downstream ``resolve_executable``-style function. When
+the two resolution bases differ (config-dir ≠ CWD), the result can point to
+the wrong file or produce a ``FileNotFoundError``. Before committing a
+relative-path fix, trace the full chain: where is the path first resolved,
+where is it stored, where is it used, and what is the CWD at each stage.
+When code uses both in-process import checks and out-of-process subprocess
+calls, verify that they resolve to the same Python interpreter and package
+set. A preflight that passes in-process (import succeeds with the current
+``sys.path`` and venv packages) does not prove that a subprocess
+``["python", "-m", "some.module"]`` will succeed — ``"python"`` may be a
+different interpreter on ``PATH``. Test the exact subprocess command (with
+the same env vars) before concluding the runtime logic is correct. When the
+subprocess command is built by an adapter from configurable fields, prefer
+to make the python executable configurable and default to the venv Python
+when the task's environment context makes that the correct choice.
 When an interface forces every caller to pass excessive parameters, consider a
 small explicit context or config object. Do not turn that into a framework when
 plain values remain clearer.
@@ -193,6 +576,12 @@ Before writing code, identify the natural owner of the change:
   and tests;
 - a loader or manifest change should mainly touch the input layer and tests.
+When public exports, docs, registries, harness runners, generated artifacts, or
+paper-output surfaces are explicitly excluded, treat existing mentions of the
+new helper in those surfaces as defects to remove, not as consistency surfaces
+to update. Tests should import scoped helpers from their owning module when the
+package entrypoint is out of scope.
 If one feature requires unrelated edits across many areas, treat that as a
 framework-boundary risk. Do the smallest local refactor that brings related code
 together, or report the coupling if a safe local refactor is outside scope.
@@ -201,6 +590,16 @@ Keep code that changes together close. Keep unrelated reasons to change in
 separate modules. Public/shared layers should contain only stable capabilities
 needed by multiple users; special cases should stay near their use sites.
+When a fix patches a latent gap at one call site and sibling call sites share
+the same pattern, they share one change reason and belong in the same change.
+Audit siblings before closing the fix: if a second entry point builds and runs
+the same kind of command, the same resolution, validation, or environment wiring
+applies there too. Fix in-scope siblings together so behavior does not diverge
+silently across entry points. When a sibling is out of scope, record the
+divergence explicitly — which site, which gap, why deferred — instead of leaving
+the inconsistency implicit. Change locality is not a reason to fix one site and
+leave an identical latent gap unflagged at a co-located site.
 ## Harness And Test Discipline
 Harnesses serve paper goals, performance comparison, method screening, module
@@ -274,6 +673,20 @@ Keep harness and test responsibilities separate:
 - export-surface assertions belong in export tests, invalid-state assertions in
   invalid-state tests, and identity/schema assertions in clearly named identity
   or schema tests;
+- for a new private helper with two or more mutually exclusive branches (a
+  relative-versus-bare path resolver, a presence-versus-absence dispatcher, an
+  enabled-versus-disabled switch), pin each branch with its own focused
+  assertion unless the project's test design explicitly forbids that test
+  category. A branch taken only when a condition holds is untested when every
+  fixture takes the other branch; a deterministic test of a pure helper is not
+  the same as a forbidden real-execution or real-data test and should not be
+  skipped on that basis. A branch-pinning assertion only counts as coverage if
+  the repository's normal validation command actually executes it: a doctest
+  embedded in a source module, a test guarded by a non-default flag, or a test
+  file outside the configured test path is not exercised by the requested
+  command and gives no evidence under it. Place the assertion where the suite
+  runs it, or explicitly label it as an unexecuted extra guard; do not count an
+  unrun doctest toward the green-suite total;
 - harness code should not become functional test code;
 - test code should not become paper-performance evaluation.
@@ -302,12 +715,26 @@ surfaces as current, stale, or historical. Edit only stale current surfaces
 needed for the accepted change. If all requested docs are current, report a
 no-op docs sync and the readback/search checks that proved it.
+When a bounded harness or smoke API is added and an in-scope harness README,
+contract note, or config/example doc already describes a broader runner,
+registry, paper experiment, or framework-level entrypoint, include that surface
+in the docs map. Either narrow the current entrypoint/output wording to the
+accepted bounded API, or mark the broader text as planned or historical if that
+is true in the repository. Do not leave full-runner or paper-output semantics as
+the current contract for a module-level helper.
 When one accepted symbol, artifact, metric, method, or helper is documented in
 multiple parallel surfaces, build a small surface map before editing: helper or
 API lists, emitted names, package/module summaries, layout rows, test summaries,
 and absence clauses. Update each stale parallel surface consistently, but do
 not add new public exports, runtime behavior, or future-plan claims just because
 the docs mention the accepted bounded surface.
+For module-level adapter surfaces, map both source/module rosters and test
+rosters. If a scoped test file exists, every current README-style test roster
+that lists comparable sibling tests should include it at the same hierarchy and
+translation level. If a scoped module has no direct test file and the task does
+not request one, do not invent an out-of-scope test solely for roster symmetry;
+report the indirect or absent coverage honestly and keep edits inside scope.
 Bind the surface map to the current selected subject. Neighboring helpers,
 methods, tests, metrics, or earlier accepted features in the same document are
 context, not part of the sync, unless the user explicitly scopes them or the
@@ -423,6 +850,16 @@ it, the current workflow instruction names that handoff, or an existing active
 trajectory already contains that selected task. Otherwise leave a neutral
 waiting state such as "no next developer task is selected."
+Listing **candidate next phases** (e.g. "Next slice options: <phase_a> or
+<phase_b>") is not the same as selecting a next task. Candidates document
+the real gaps visible after the current phase completes — they inform a future
+selection decision but do not make one. A memory file that ends with a
+candidate list and no explicit selection is in a neutral waiting state, not a
+selected handoff. When updating memory after completing a phase, advance the
+stale pointer past the completed phase and list the remaining real gaps as
+candidates; do not promote any candidate to "selected next task" unless the
+user, workflow, or active trajectory explicitly selects it.
 When the accepted source/test task explicitly excluded docs, TODO, exports,
 harnesses, experiments, or generated outputs, preserve that exclusion in the
 trajectory. A later TODO-only pass may record accepted work and verified stale
@@ -546,15 +983,164 @@ problem appears.
 Use the user's requested validation command when provided. Before running, check
 that every explicitly requested target exists; a missing target is a blocker to
 report, not permission to silently narrow the command or create the target.
+Run the command literally from the repository root unless the user gave another
+working directory. Do not substitute a broader suite, omit arguments, add
+environment variables, or rely on an unreported install/import workaround as
+evidence for the requested command. If setup is required before the exact
+command can run, state that setup separately and then rerun and report the exact
+command result.
+When reporting validation, keep each requested command's result separate unless
+you have explicitly deduplicated overlapping tests. Do not invent a combined
+total from overlapping suites, and check that any subtotals you report add up.
+If a command was run twice because suites overlap, say that clearly instead of
+presenting the repeated tests as additional coverage.
+The headline status must match the worst required validation result. Do not say
+"all validations pass", "all clean", "accepted as-is", or "complete" when any
+required command failed, collected errors, or was skipped. If a failure appears
+pre-existing or out of scope, label the command as failed with a pre-existing or
+scope caveat; do not convert it into a passing validation summary.
+If you run extra tests beyond the user's requested validation, label them as
+extra guards and keep them out of the required-command total. Do not describe an
+unrequested test file or suite as part of the scoped validation surface unless
+the task or review explicitly added it.
+For a pure structural move (no logic change), verify byte-level identity between
+the new file and the original content — via diff, checksum, or file-copy
+confirmation. When the task allows exactly one import-path line to differ, diff
+the two files and confirm only that line changed. Verify import direction after
+the move: the new canonical module should not import through the old shim path,
+and layered packages should only import from lower layers. An AST-level import
+scan or a focused grep for `from <wrong_direction> import` catches wrong-way
+dependencies that tests may not exercise.
 For source or test changes, prefer the smallest relevant test target that proves
 the accepted contract, unless the user asked for a broader suite. Use command
 forms that avoid repository cache or bytecode artifacts when the project allows.
+When tests are intentionally removed because they covered an excluded surface,
+the expected test count should decrease. Treat an unchanged or unexpectedly
+higher count as a signal to re-check for stale tests. A full-suite pass is useful
+context but does not replace the exact requested validation result.
+If the exact validation command names files or directories, those paths are
+preserved targets unless the user explicitly says to remove them. A missing
+target is not a successful cleanup; restore the accepted target or report the
+conflict instead of narrowing the command, deleting dependent modules, or
+substituting a broader suite.
+If required validation fails because an out-of-scope dependency, record,
+adapter, metric, parser, or export contract is missing or incompatible, do not
+repair that external contract inside the current task. Report a
+validation-scope conflict with the import or call chain, the smallest candidate
+scope expansion, and any in-scope validation that still passes. Passing the
+requested command after unapproved out-of-scope edits is still not a successful
+completion.
+When a verification or refactor task discovers a **pre-existing architectural
+violation** — an import direction violation, a transitive runtime dependency, a
+god-module leak, a structural inconsistency — that predates the current task
+scope, do not fix it and do not let it block acceptance. Classify it: is it
+pre-existing (flat-module-era code that was never layered) or task-induced (the
+current move/split created a new wrong-way dependency)? Pre-existing violations
+are recording targets: document them in memory as deferred reconciliation work
+with the specific import chain, the layer rule violated, and the suggested
+narrow fix. Task-induced violations are blocking defects that must be fixed
+before acceptance. The test for pre-existing: if reverting the current task's
+changes would leave the violation intact, it is pre-existing. If the violation
+appears only because of the current task's module move or import change, it is
+task-induced.
+For stabilization tasks, inspect whether the tests still prove the named
+contract. A green run is weak evidence when assertions were loosened, regression
+cases were removed, or only happy-path behavior remains. Restore or add focused
+coverage for every user-named behavior before reporting that the surface was
+already stable.
+When a stabilization task says to add or verify focused tests, treat "verify"
+literally. First map the existing test names, fixtures, and assertions to each
+user-named behavior. If the behavior is already covered and the exact requested
+validation passes, do not add duplicate tests just to create activity. Add or
+restore tests only for missing, weakened, or ambiguous coverage, then report the
+coverage map and exact command results.
+When a stabilization pass makes no edits, the coverage map is the main
+deliverable. Do not stop at "no changes needed" plus a validation table; name
+the existing source, test, and docs assertions that prove each user-named
+contract item, and call out any item that is intentionally only indirectly
+covered.
+When the scope asks for non-mutation "where applicable", treat it as a concrete
+coverage item. For each helper or adapter, identify caller-owned inputs that
+could be mutated, such as mappings, sequences, records, dataclasses, configs, and
+fixture objects. Add a focused non-mutation assertion for every mutable or
+caller-owned input, including optional scoped keys or fields, or explicitly
+report why mutation is not applicable.
 After validation, check for generated cache/build/test artifacts created by the
 run and remove only those generated artifacts. Do not clean unrelated dirty or
 untracked user work.
+Before reporting success after source, test, config, or docs edits, run the
+repository's normal formatting or whitespace check when one exists. If no
+project-specific check is known, run the version-control whitespace check, such
+as `git diff --check`, from the repository root. Treat trailing whitespace,
+conflict markers, and blank-line-at-EOF warnings as blockers even when all
+tests pass. This check is especially important after deleting duplicate tests,
+merging adjacent blocks, or editing the end of a file.
+Report whitespace checks exactly: "clean" means no output or findings. If the
+check reports a pre-existing out-of-scope finding, say that it is a separate
+scope conflict or pre-existing blocker; do not call the command clean and do
+not fix the out-of-scope file unless the active task permits it.
+In validation tables, keep the status label and note consistent: use "passed
+with pre-existing warnings" or "blocked by pre-existing findings" as
+appropriate, not "clean", whenever the command emits any warning or finding.
+When the task has an explicit allowed-file set or excluded surface list, run a
+changed-file check before reporting success. The only newly modified,
+untracked, moved, deleted, or created repository paths should be the allowed
+paths plus disposable validation artifacts that were removed. Also search for
+the accepted symbol or capability in excluded surfaces such as package
+entrypoints, export tests, docs, registries, harnesses, memory/TODO files, and
+artifact writers when those surfaces were named as exclusions. Passing tests do
+not override an out-of-scope changed file or stale excluded-surface reference.
+If you report that a file or capability was removed, verify the file path no
+longer exists, no untracked placeholder remains, and no package metadata or
+module entrypoint still references it.
+If you report that a file was rewritten, replaced, or reduced to a narrower
+surface, re-open the final file from disk and verify the exact callable
+signature, key result fields, line-level absence of prohibited imports or
+helpers, and expected line-count direction before reporting success. Do not rely
+on an editor buffer, generated patch text, prior session output, or developer
+report as evidence that the rewrite persisted.
+For importable modules, also verify that the code loaded by the validation
+environment is the same final source file when stale behavior has appeared
+before. Use the language's normal inspection tools when cheap, such as checking
+the loaded file path and public signature. If the test runner imports an older
+surface than the file you think you wrote, stop and resolve that mismatch before
+claiming validation success.
+When the user gives an absence or stale-reference search, run that command
+literally. If no command is given, search every plausible surface for the banned
+capability: tracked and untracked source, tests, docs, package metadata, module
+entrypoints, examples, configs, and harness descriptors. Do not limit the search
+to the files you edited or to the modules you expected to change.
+For docs-only cleanup with an exact absence command and scoped file list, treat
+that command as the primary acceptance gate. Run it before editing to build the
+hit list, and rerun it after every docs pass until it returns zero hits across
+all scoped files. Do not mark unedited files as "unaffected", "already clean",
+or "persisted" from memory; the final absence output and targeted readback must
+prove every scoped file is clean. If the absence command still has hits, report
+the remaining stale surfaces instead of using passing tests as a completion
+claim.
+Exact absence searches are literal, not semantic. A banned phrase still fails
+when it appears in a negative boundary statement, future-plan note, heading, or
+historical sentence. Rephrase scoped docs so the exact tokens disappear while
+preserving the boundary meaning, or report a scope conflict if the stale phrase
+must remain in an out-of-scope historical surface.
+If an exact absence command matches accepted baseline surfaces or unrelated
+scoped documentation, keep the original command as the failing task evidence and
+report the scope conflict. A narrowed or corrected regex can be useful as a
+diagnostic, but it is not a substitute for the user's required validation
+command unless the user or reviewer revises the task. Do not remove accepted
+baseline documentation, exports, tests, or API names only to satisfy an
+overbroad absence pattern.
+Rerun absence searches after the final edit, not before the last deletion. For
+file deletion, verify each requested path individually with a filesystem check;
+glob output, command success text, or a prior deletion attempt is not proof that
+the file is absent.
 For docs-only or TODO-only work, do not run tests unless executable code or
 test files changed accidentally. Re-read edited docs/TODO files and run targeted
 text searches for the accepted names, stale predecessor names, and broad absence
@@ -650,11 +1236,52 @@ For bounded helpers, verify that the implementation:
 - avoids adjacent runtime surfaces such as loaders, registries, exporters,
   harnesses, CLI, experiments, or paper outputs unless explicitly in scope.
+When reviewing a scoped change with an allowed-file list, compare the actual
+changed-file list against that list before reviewing behavior. Flag any package
+export, export-surface test, documentation, registry, harness, artifact writer,
+TODO/memory note, stash/backup directory, generated file, or adjacent module as
+blocking when the request excluded it. Do not accept in-repository stash folders
+as cleanup; out-of-scope work must be removed from the task's worktree state or
+explicitly separated outside the repository by user-approved workflow.
+For cleanup reviews, compare deletions against the preservation map and the
+requested validation command. Deleting a validation target, accepted feature, or
+unrelated subsystem is blocking even when the stale excluded search becomes
+clean. Import errors after removing an excluded file usually mean stale public
+wiring remains; they do not justify deleting the importer or collapsing the
+larger surface.
+When the developer claims an excluded capability was removed, verify with the
+filesystem, full untracked status, and targeted search. Comment-only stubs,
+empty files, disabled tests, parser branches, package metadata entries, module
+entrypoints, and docs that still mention the capability are still present
+surfaces. Treat mismatches between the report and the worktree as blocking
+until the worktree is the source of truth.
+When a parameter, option, command, or public surface is removed, search its
+literal name across implementation, tests, docs, configs, and examples, then
+read the caller chain around remaining hits. A stale signature, forwarded
+keyword, config key, fixture assertion, or README command is blocking unless it
+is explicitly outside the task and reported as a scope conflict.
+For command-like surfaces, search both structural names and user-visible
+actions: module names, compatibility aliases, entrypoint functions, parser or
+handler helpers, package metadata keys, command strings, and tests that import
+or exercise those handlers. A wrapper that only re-exports the removed handler
+is still the removed surface.
+When reviewing a stabilization report that says no changes were needed, still
+compare implementation and tests against the accepted contract. Flag drift in
+call signatures, default arguments, derived keys, empty-input handling,
+non-mutation or provenance behavior, and removed regression tests even if the
+requested commands pass.
 For documentation reviews, compare every newly edited absence clause against
 the implemented-surface list, package/module summaries, layout rows, and test
 summaries. Treat broad "no <category>" wording as a defect when a narrower
 bounded surface in that category is already accepted; ask for the smallest
 wording fix instead of reopening source or tests.
+If a file is meant to be a static config or example, docs should not describe it
+as a runner, entrypoint, command surface, managed artifact generator, or paper
+output path. Replace execution-surface wording with the narrow API or config
+semantics that the task actually accepts.
 Also treat leftover duplicated words, duplicated sentence tails, or malformed
 negative clauses as docs defects when they change or obscure the intended
 scope.
@@ -789,10 +1416,24 @@ After edits, audit:
 - no avoidable global state, hidden paths, repeated registration points, or
   heavy config burden were added;
 - the change stayed local to the natural owner;
+- stabilization preserved accepted callable signatures, defaults, key
+  derivation, boundary behavior, provenance, non-mutation, and regression
+  coverage;
+- cleanup preserved validation targets and accepted work instead of deleting
+  dependent modules, tests, or subsystems to avoid stale references;
 - harness and test responsibilities remain separate;
 - artifact schemas, exporters, docs, and tests agree when any changed;
 - framework docs were updated or confirmed current when in scope;
 - external reused code has compatible license and attribution;
+- explicit allowed-file and excluded-surface scope was preserved, including no
+  in-repository stash, backup, memory, TODO, export, docs, harness, registry, or
+  artifact files created as cleanup side effects;
+- edited files pass the repository whitespace or diff check; test
+  deduplication, file-end edits, and copied docs blocks left no trailing
+  whitespace, conflict markers, or extra blank lines at EOF;
+- excluded capabilities are absent from source files, tests, docs, package
+  metadata, parser or handler branches, module entrypoints, and untracked files,
+  with no explanatory stubs or placeholders left behind;
 - no generated cache/build/test/output/result artifacts were left behind unless
   explicitly requested.
@@ -810,5 +1451,26 @@ Keep the final response concise:
 - validation performed, using readback/search checks for docs-only work;
 - caveats that affect the user's next action.
+A round's real delta is not only source/test edits: a verification or smoke run
+that wrote on-disk artifacts, a status or memory-record flip, and a docs/config
+sync all count. Do not open a report with "no changes needed", "no work
+required", or "already complete and verified on disk" when any such delta
+exists; "no source/behavior changes" is the precise, sanctioned phrasing when
+only non-source surfaces moved, and those edits must still be listed as this
+round's work. Do not sustain a no-op framing by attributing this round's delta
+to a prior round (for example labeling this round's status flip as "updated
+previous turn"), and do not list state already on disk before this round as
+this round's work — mis-attribution in either direction is the same honesty
+defect. Report only the real delta, verified against the prior on-disk state,
+not against memory or a task narrative; a verification run plus a status/memory
+flip is a deliverable, not "no work".
+A pure verification pass — where the task asks to confirm that prior work is
+intact and the developer inspects state, runs tests, and finds no new edits are
+needed — is the exception. In that case the report should describe the checks
+performed and their results, and may state "no new changes made" or "prior work
+confirmed intact on disk." Do not use this exception to relabel an
+implementation pass where work was silently skipped.
 Do not explain skill internals, tool mechanics, or style theory unless the user
 asked for a skill optimizer report.