academic-army 0.2.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -18,9 +18,9 @@ how to keep the implementation readable, local, low-coupling, testable, and
18
18
  consistent with the current framework.
19
19
 
20
20
  Do not use this skill to initialize a repository template or recreate a project
21
- scaffold from an empty directory. Template initialization belongs to a separate
22
- skill. This skill may add files, modules, tests, harness support, or docs only
23
- when the current task and current repository need them.
21
+ scaffold from an empty directory. Repository initialization is out of scope. This
22
+ skill may add files, modules, tests, harness support, or docs only when the
23
+ current task and current repository need them.
24
24
 
25
25
  ## Operating Boundary
26
26
 
@@ -69,21 +69,133 @@ Before editing, establish a small task-relevant inventory:
69
69
  - files and directories relevant to the requested change;
70
70
  - expected source, test, harness, export, docs, and dependency surfaces;
71
71
  - files that must be left untouched by scope;
72
+ - any explicit allowed-file list or explicit excluded surfaces from the user
73
+ request, treated as a hard scope fence;
72
74
  - existing test and harness layout when relevant;
73
75
  - current dirty or untracked files, without reverting user work;
76
+ - **import surface**: when a module will be moved, renamed, or deleted, search
77
+ every import form that touches it — `from .<module> import`,
78
+ `from <package>.<module> import`, `from <package> import <module>`, and any
79
+ indirect imports through package `__init__.py` — so the full consumer set is
80
+ known before the first edit;
74
81
  - accepted constructor fields, identity fields, validation owner, provenance
75
- fields, and export surfaces for record-backed helpers.
82
+ fields, and export surfaces for record-backed helpers;
83
+ - accepted callable signatures, default values, aggregation or identity keys,
84
+ empty-input behavior, non-mutation expectations, provenance expectations, and
85
+ regression tests that must not be weakened;
86
+ - task-stated current-state claims — what the task says already exists, is
87
+ missing, is on or off a path, or has or has not run — reconciled against the
88
+ actual worktree before acting, with contradictions surfaced as the first
89
+ finding.
76
90
 
77
91
  Treat a suddenly empty or partially missing tree as an integrity blocker. Do not
78
92
  reconstruct missing code from memory, plans, reports, or old outputs unless the
79
93
  user asks for restoration from a trusted source.
80
94
 
95
+ Treat the task's stated current-state as a claim to verify, not as ground truth.
96
+ Before acting, check each factual premise against the worktree: the paths,
97
+ importability, or artifacts the task asserts as present or absent. When the
98
+ worktree contradicts the task — artifacts the task calls absent already exist, a
99
+ module the task calls off-path is importable, a run the task says never happened
100
+ already produced outputs — surface the contradiction first and re-scope the
101
+ remaining work to the real gap. Do not proceed on a stale premise to manufacture
102
+ work the worktree already satisfies, and do not overwrite or ignore existing
103
+ artifacts to match the task's description. Accepted work is what the real gap
104
+ requires, not what the task's narrative implies.
105
+
106
+ When the worktree already satisfies the task's stated objective (the target
107
+ package exists, the files are in place, the shims are written, the structure
108
+ matches the refactor plan), the task is an **already-complete-on-disk** case.
109
+ The correct operational sequence is verification, not re-implementation:
110
+
111
+ 1. Run the scoped verification the task requests (test suite, import check,
112
+ whitespace check).
113
+ 2. When verification passes, the primary deliverable is updating memory and
114
+ trajectory files to record the verified state — flipping the phase status,
115
+ recording exact test counts, shim identities, and import findings, and
116
+ advancing the stale selection pointer.
117
+ 3. Do not re-execute the structural work (re-create files, re-split modules,
118
+ re-write shims) just because memory says it was never done. Memory is the
119
+ stale surface; the worktree is the source of truth.
120
+ 4. After updating memory, the next-slice pointer should move past this phase
121
+ to the next real gap. List candidate next phases without selecting one
122
+ unless the task or active workflow explicitly selects it.
123
+
124
+ This pattern is common in phased refactors where work was done in a prior
125
+ session but memory was never updated. The no-op guard in memory files (e.g.
126
+ "N consecutive no-op rounds") exists precisely to detect and break this cycle:
127
+ when memory says "not started" but the worktree says "done," verify and record,
128
+ do not re-execute.
129
+
130
+ If the request names an exact allowed-file set, edit only those files. Do not
131
+ touch package entrypoints, export tests, docs, registries, harnesses, artifact
132
+ writers, TODO or memory files, generated outputs, or adjacent modules unless
133
+ they are explicitly in the allowed set. Moving an out-of-scope file into a
134
+ temporary, stash, backup, or memory folder inside the repository still changes
135
+ the repository and does not satisfy the scope fence. Leave unrelated untracked
136
+ or dirty files alone; if they break imports or validation, report the blocker or
137
+ ask for an explicit scope expansion instead of repairing them under the current
138
+ task.
139
+
140
+ Do not self-expand the allowed-file list. A developer report, rationale table,
141
+ or "scope updated" note does not change the task scope by itself. Treat scope as
142
+ expanded only when the active user instruction or controlling task definition
143
+ actually supplies the expanded file list or explicitly authorizes the adjacent
144
+ contract work. Until then, out-of-scope fixes remain out of scope even if they
145
+ would make validation pass.
146
+
147
+ Guard suites and accepted integration surfaces are validation surfaces, not
148
+ edit surfaces. If a guard imports or exercises an out-of-scope module, that does
149
+ not make the module editable. Preserve it as accepted baseline unless it is in
150
+ the allowed-file list. If stale wiring in that module blocks validation, first
151
+ try to satisfy the contract from the scoped owner; otherwise report a
152
+ validation-scope conflict and the smallest needed scope expansion.
153
+
154
+ If an explicit capability ban conflicts with the allowed-file list because a
155
+ live prohibited surface already exists outside that list, do not ignore it and
156
+ do not declare the task stable. Report the scope conflict before editing, or
157
+ remove it only when the user or accepted review explicitly makes that surface a
158
+ cleanup target. The conflict should be visible in the baseline, not discovered
159
+ only after validation fails.
160
+
161
+ When a capability category is explicitly excluded, build a short removal or
162
+ absence checklist for every place that category can live in the current repo:
163
+ source files, module entrypoints, package metadata, parser or handler functions,
164
+ tests, docs, examples, and generated or helper files. "Removed" means absent
165
+ from the filesystem and changed-file list, not replaced by a comment-only stub,
166
+ empty placeholder, renamed backup, or disabled test that still sits inside the
167
+ repo. Search the excluded command names, module names, and user-facing phrases
168
+ after cleanup rather than relying on memory of earlier edits.
169
+ Also inventory aliases for the excluded capability: compatibility wrappers,
170
+ re-export modules, alternate file names, embedded handler functions inside
171
+ otherwise legitimate modules, package-script metadata, and tests named after
172
+ the command rather than the original module. Removing only the first obvious
173
+ file is incomplete when another path still exposes the same capability.
174
+
175
+ If you must undo your own accidental out-of-scope edits, make the smallest
176
+ surgical removal needed to restore the prior surface. Do not rewrite whole
177
+ entrypoints, export tests, docs, or config files as a cleanup shortcut, because
178
+ that can erase unrelated user work and expand the diff beyond the task.
179
+
180
+ For cleanup tasks, make a preservation map before deleting anything: files or
181
+ symbols explicitly requested for removal, files that only need references
182
+ removed, validation targets that must still exist, and accepted work that must
183
+ survive unchanged. Never delete required validation targets, accepted feature
184
+ modules, or their tests just because they depend on a stale excluded import.
185
+ Fix the stale reference at its owner, or report a scope conflict if the owner is
186
+ outside the allowed files.
187
+
81
188
  ## Task Classification
82
189
 
83
190
  Classify the task before editing:
84
191
 
85
192
  - **Feature or implementation**: add the smallest clear code path that satisfies
86
193
  the requested behavior.
194
+ - **Stabilization or acceptance**: compare the current draft against the
195
+ accepted contract before deciding no edits are needed. Passing tests alone is
196
+ not enough; verify signatures, defaults, key derivation, boundary behavior,
197
+ non-mutation/provenance requirements, docs wording, and the tests that prove
198
+ those behaviors.
87
199
  - **Refactor or cleanup**: move, split, merge, rename, or delete code only to
88
200
  improve locality, readability, or testability for the current change.
89
201
  - **Harness work**: keep harness code under the relevant `harness/` area; make
@@ -93,10 +205,111 @@ Classify the task before editing:
93
205
  - **Method, baseline, metric, or export work**: keep the change near the owning
94
206
  extension point and update registration, docs, exports, and tests only when
95
207
  those surfaces are in scope.
208
+ - **Bounded bridge or helper work**: keep the callable in its owning module and
209
+ avoid broadening public package exports, registries, docs, CLIs, harnesses, or
210
+ artifact surfaces unless the request explicitly includes those surfaces.
211
+ - **Bounded runtime adapter stabilization**: keep discovery, command-plan
212
+ construction, preflight, staging, and dry-run behavior in the owning adapter
213
+ modules. Tests should use small local fixtures, temporary paths,
214
+ monkeypatched imports or subprocesses, and explicit dry-run/preflight checks.
215
+ Do not add real runtime execution, package exports, CLIs, harness runners,
216
+ registries, generated paper artifacts, or broader integration behavior unless
217
+ the request explicitly scopes them. When the request does explicitly scope real
218
+ execution, treat that scope as a fence: run only the bounded slice it names,
219
+ keep adjacent surfaces forbidden, and report a missing-input, preflight, or
220
+ wrong-environment blocker rather than widening the command or carrying the
221
+ authorization into a neighboring slice.
222
+
223
+ When the task authorizes "any minimal fix needed" to make real execution
224
+ complete, a config-only change (adding a path entry, an executable
225
+ override, or a flag value to an existing config field that the adapter
226
+ already accepts) stays inside the fence. It does not widen the command,
227
+ add new source code, or create new adapter surfaces. Distinguish this from
228
+ adding a new CLI flag, registry entry, or module export — those widen the
229
+ surface and remain forbidden unless explicitly scoped.
230
+
231
+ Before proceeding from dry-run to real execution, verify that the
232
+ subprocess environment matches the in-process preflight environment.
233
+ In-process import checks (``command_module_errors``,
234
+ ``python_module_errors``) resolve imports with the current Python
235
+ interpreter and ``sys.path``. A subprocess launched via
236
+ ``subprocess.run(["python", ...])`` may invoke a different Python
237
+ interpreter (the first ``python`` on ``PATH``) that lacks packages
238
+ installed only in the project venv. When preflight passes but the
239
+ subprocess fails with an import error or exit code 1, the first
240
+ diagnostic is: "is the subprocess running the same Python?" Test the
241
+ actual subprocess command manually before concluding the adapter logic
242
+ is wrong. The fix — specifying the venv Python as the plan's
243
+ ``python_executable`` — is a config-only correction, not a command
244
+ widening.
245
+
246
+ When transitioning a config from smoke/dry-run to real execution, audit
247
+ which run spec each runtime plan is attached to. Smoke configs sometimes
248
+ attach plans to a convenience spec (e.g. the first spec, or the one
249
+ without deferred loading) because dry-run never consumes the outputs.
250
+ Real execution may require those plans on a *different* spec — the one
251
+ whose scheduler delivers the media objects that the generated artifacts
252
+ are meant to annotate. The diagnostic question is: "does the downstream
253
+ consumer of these runtime outputs receive them from the same run spec?"
254
+ If a quality-replacement function reads ``reference_results`` from the
255
+ current run but the reference objects are only delivered by a different
256
+ scheduler on a different spec, the bridge silently never fires. Read the
257
+ consumer (the quality-adjustment or annotation function that reads the
258
+ runtime results) to confirm it receives
259
+ the results from the right execution context before concluding the config
260
+ is correct.
261
+
262
+ When a downstream quality or annotation function contains a conditional
263
+ early-return (e.g. "only adjust if delivered objects include
264
+ reference-type media"), verify that the condition actually fires during
265
+ real execution. A "no reference objects in frame" pass-through is
266
+ functionally silent — the run succeeds but the measured-quality bridge
267
+ never activates. After the first real run, inspect the frame outcomes of
268
+ the relevant frames to confirm the adjusted fields differ from what the
269
+ same run would produce with ``execute_runtime_plans=false``. If they are
270
+ identical, the bridge did not fire; diagnose the condition (missing
271
+ reference objects, frame mapping gap, missing CSV) rather than accepting
272
+ a green run at face value.
273
+
274
+ When replay steps (controlled by trace length parameters such as
275
+ ``camera_limit``, ``bandwidth_limit``) and media-object frame ranges
276
+ (controlled by ``frame_start``/``frame_end`` in options) are configured
277
+ independently, verify that the replay covers the frames for which
278
+ runtime-generated artifacts exist. If artifacts target frame *N* but the
279
+ replay only executes up to frame *N-1*, the artifacts will never be
280
+ delivered and any quality bridge depending on them will silently skip.
281
+ The fix is a config adjustment (extend the trace limits), not a runner
282
+ change.
283
+
284
+ When a task requires re-running an experiment whose pipeline contains
285
+ non-deterministic components (stochastic training with few epochs, random
286
+ initialization, GPU non-determinism), the re-run will produce different
287
+ numeric metrics even though the mechanism is unchanged. A task constraint
288
+ like "the re-run must reproduce the accepted quality_score=<exact value>"
289
+ may be physically unsatisfiable if the pipeline variance exceeds the
290
+ tolerance implied by the constraint. Before accepting such a constraint
291
+ at face value, check whether the pipeline is deterministic — if it is
292
+ not, verify that the *mechanism* (parameter flow, quality bridge,
293
+ computation formula) survived the re-run, and report that exact
294
+ metric-level reproduction is not guaranteed. Do not try to force
295
+ determinism through seed pinning or training changes unless the task
296
+ explicitly scopes them.
297
+ - **Surface cleanup**: remove only the excluded surface and its named wiring.
298
+ Preserve validation-target modules, accepted features, and unrelated behavior.
299
+ If deleting one excluded file breaks imports, remove that file's public wiring
300
+ rather than deleting the importer, the validation target, or the larger
301
+ subsystem that was supposed to remain.
96
302
  - **Validation-only pass**: run the exact requested command from the repository
97
- root. If it passes, make no source, test, docs, dependency, export, or TODO
303
+ root. If it passes, make no source, test, docs, dependency, or export
98
304
  changes except removing artifacts created by the run. If it fails, inspect
99
305
  the failure and make only the smallest local fix to the accepted contract.
306
+ Memory and trajectory files (progress records, status snapshots, validation
307
+ logs, known-gap lists) are **not** source/test/docs/export changes — they
308
+ are trajectory surfaces that record verified facts. When the task explicitly
309
+ scopes memory-file updates as part of a validation pass, those updates are
310
+ the primary deliverable, not a scope violation. Distinguish this from
311
+ inventing new TODO items or selecting next tasks — recording verified state
312
+ is maintenance, not feature work.
100
313
  - **Framework or docs sync**: update framework docs when module boundaries,
101
314
  extension points, harness/test organization, artifact schemas, or repository
102
315
  responsibilities change and docs are in scope or are part of the accepted
@@ -148,7 +361,9 @@ Keep responsibilities single:
148
361
  Prefer inline or local helpers when logic is used once and remains readable.
149
362
  Extract helpers, adapters, registries, factories, contexts, or interfaces only
150
363
  when they provide real reuse, isolate a stable boundary, preserve an invariant,
151
- reduce caller code, or make tests simpler.
364
+ reduce caller code, make tests simpler, or when a refactor plan separates
365
+ conceptually distinct concerns into separately named modules (even if the
366
+ helpers are currently only called from one file).
152
367
 
153
368
  When reusing an existing private helper for a broader case, first check whether
154
369
  the helper name, parameters, and doc-adjacent wording still describe every
@@ -164,6 +379,174 @@ Reduce global state, hidden path assumptions, implicit side effects, long call
164
379
  chains, repeated registration points, and heavy configuration for simple
165
380
  experiments.
166
381
 
382
+ ### Staged Package Migration (Re-Export Shim)
383
+
384
+ When a phased refactor moves a module to a new canonical location, leave the
385
+ old path as a thin re-export shim rather than hunting down and rewriting every
386
+ import site at once. The shim keeps the old module file but replaces its entire
387
+ body with a single import that re-exports the public surface from the new
388
+ location. Every existing `from <old> import X` or `from .<old> import X`
389
+ continues to resolve without change; consumers migrate to the new path in later
390
+ phases at their own pace.
391
+
392
+ A proper shim:
393
+
394
+ - imports every public name (classes, functions, type aliases) the old module
395
+ previously exported, so no consumer sees a missing name;
396
+ - also re-exports any name that tests access as a module attribute for
397
+ monkeypatching (`monkeypatch.setattr(module, "name", ...)`), even if no
398
+ consumer does `from module import name` — a monkeypatch target that is absent
399
+ from the shim will fail with `AttributeError`;
400
+ - does not import private helpers unless a consumer is confirmed to depend on
401
+ them — but when in doubt, re-export the full surface to be safe;
402
+ - has no logic, no side effects, and no new imports beyond the re-export line;
403
+ - lives as a temporary bridge; the shim is deleted once all consumers have
404
+ migrated to the canonical path.
405
+
406
+ Before writing a shim, inventory the full import surface of the module being
407
+ moved: search for `from <package>.<module> import`, `from .<module> import`,
408
+ `from <package> import <module>`, and any indirect imports through package
409
+ `__init__.py`. Confirm every name that any consumer imports is present in the
410
+ shim. After writing the shim, run the full test suite — shim fidelity is the
411
+ central risk, and a missing re-export will surface as `ImportError` in
412
+ consumers.
413
+
414
+ After the move, verify import direction: the new canonical module should not
415
+ accidentally import through the shim (creating a circular path), and layered
416
+ packages should only import from layers below them.
417
+
418
+ Also verify import **depth**: when code moves from a flat package into a
419
+ subpackage N levels deeper (e.g. flat → `<pkg>/sub1/sub2/`), relative imports
420
+ to sibling top-level packages need N extra dots. From `<pkg>.sub1.sub2`, one
421
+ dot is `<pkg>.sub1.sub2`, two dots is `<pkg>.sub1`, three dots is `<pkg>`.
422
+ Count from the new file's package position up to the common ancestor, then down
423
+ to the target. A smoke test that imports every moved module immediately after
424
+ creation catches depth errors before the test suite runs.
425
+
426
+ Also verify **monkeypatch continuity** when the moved code calls a function that
427
+ tests monkeypatch through the old module's namespace. After the split, the new
428
+ canonical module has its own import binding for that function — independent of
429
+ the shim's binding — so a `monkeypatch.setattr(shim, "name", fake)` does not
430
+ affect the new module's call sites. When this pattern exists, make the new
431
+ module look up the function through the shim at call time (e.g. `import
432
+ <old_module> as _shim` and call `_shim.<name>(...)`) rather than via a direct
433
+ bare-name import. This keeps the monkeypatch-able namespace as the single
434
+ point of indirection without changing any logic.
435
+
436
+ This pattern applies when the monkeypatch target is a **refabr-internal
437
+ function binding** — a function defined inside the repository that the
438
+ canonical module imports via `from <module> import <func>`. A different
439
+ mechanism applies when the monkeypatch target is a **shared stdlib singleton**
440
+ (e.g. `importlib`, `os`, `sys`). Stdlib modules are singletons cached in
441
+ `sys.modules` — every module that does `import importlib.util` receives a
442
+ reference to the same `importlib` object. When a test patches
443
+ `<shim>.importlib.util.find_spec`, it mutates `sys.modules['importlib'].util.find_spec`,
444
+ and the canonical module's own `importlib.util.find_spec(...)` calls see the
445
+ same mutation because both reference the same shared singleton. In this case
446
+ **no module-level-alias fix is needed** — the shim only needs to re-export
447
+ the stdlib module binding (e.g. `from <canonical> import importlib`) so that
448
+ `<shim>.<stdlib_module>` resolves. Before applying either fix, determine
449
+ which mechanism is in play: test `import sys; <canonical>.<module> is
450
+ sys.modules['<module>']`. If true, it's shared-singleton propagation and needs
451
+ only shim re-export. If false, it's a refabr function binding and needs the
452
+ module-level-alias fix.
453
+
454
+ Also verify **transitive import direction** — not just what the new module
455
+ directly imports, but what those imports drag in. A module in a restricted
456
+ layer (e.g. a summarizer that must not touch runtime) may import only result
457
+ dataclasses from a higher layer — semantically clean. But Python executes every
458
+ module-level `import`/`from` statement of the imported module, so if that
459
+ higher-layer module imports runtime adapters, scheduler registries, or config
460
+ registries at module level, the summarizer transitively pulls them all in. This
461
+ is a transitive import violation even though the summarizer's own `import` lines
462
+ look correct. Audit transitive chains by following each import target module's
463
+ own module-level imports recursively until reaching a layer boundary. When a
464
+ transitive violation exists, the narrow fix is to extract the needed result types
465
+ into a dedicated leaf module that only imports from permitted layers (typically
466
+ domain), then have both the higher-layer module and the summarizer import from
467
+ that leaf. When such extraction is out of scope, record the pre-existing
468
+ violation in memory and defer it — do not silently accept it as clean.
469
+
470
+ When a phased refactor creates a package directory that shares its name with
471
+ an existing flat module (e.g. creating `metrics/` when `metrics.py` already
472
+ exists in the same directory), Python resolves `import <pkg>.<name>` to the
473
+ package directory, making the flat file unreachable under its original import
474
+ path. Before creating the package, check for a same-name flat module. If one
475
+ exists, absorb its public surface into the package's `__init__.py` and delete
476
+ the flat file from the working tree. The absorbed content must be byte-identical
477
+ to the original — no logic changes. Consumers that imported from the flat
478
+ module (e.g. `from <pkg>.<name> import <symbol>`) will now resolve through the
479
+ package `__init__.py`; verify that every public name the flat module exported
480
+ appears in the `__init__.py`. If the flat module also has consumers that import
481
+ it as a standalone object (`import <pkg>.<name> as <alias>`), those consumers
482
+ resolve to the package module (`<pkg>/<name>/__init__.py`) transparently.
483
+ A `git rm` of the flat file is the cleanest removal — do not leave a `.bak`
484
+ rename, a comment-only stub, or an empty placeholder that would still shadow
485
+ the package in some Python import resolution orders.
486
+
487
+ When extending a package's `__init__.py` to re-export a newly added sibling
488
+ module, check for circular imports before adding the re-export line. If the
489
+ new sibling module imports from another sibling (e.g. `<new>.py` does
490
+ `from ..<old_shim> import <name>`), and the `__init__.py` already re-exports
491
+ from that other sibling (or from the shim that resolves to it), adding
492
+ `from .<new> import ...` to `__init__.py` creates a circular chain:
493
+ `__init__` → `.<new>` → `..<old_shim>` → `__init__`. When this is detected,
494
+ leave the `__init__.py` incomplete — the new sibling remains accessible
495
+ through its root re-export shim, not through the package `__init__.py`.
496
+ Document the deferred `__init__.py` re-export as a known gap for a future
497
+ dedicated decoupling pass. Do not reorder imports, add lazy imports, or
498
+ restructure sibling dependencies to force the `__init__.py` re-export — those
499
+ are behavior-sensitive changes that belong to a later phase.
500
+
501
+ ### Domain-Local Extraction vs. Cross-Module Dedup
502
+
503
+ When a task asks to extract shared helpers, distinguish two operations with
504
+ very different risk profiles:
505
+
506
+ **Domain-local extraction** — moving private helpers from one module to a
507
+ sibling within the same package. The helpers already live in the module that
508
+ owns them; the extraction just gives each concern its own named file. Call
509
+ sites stay in the same file(s); the only change is an import line switching
510
+ from local `def` to `from .sibling import`. Risk is low because the contract
511
+ does not expand to new callers.
512
+
513
+ **Cross-module dedup** — deleting copy-pasted helpers from many modules and
514
+ having them all import from a single canonical source. Risk is high when the
515
+ copies are not byte-identical: subtle differences in whitespace handling,
516
+ signature types, edge-case behavior, or sibling-helper availability mean a
517
+ mechanical unification silently changes validation behavior for some modules.
518
+ Before attempting cross-module dedup, inventory every copy and compare their
519
+ bodies. If copies differ, do not unify them under the current task — defer to
520
+ a dedicated reconciliation pass that can assess each behavioral difference
521
+ individually. The skill's rule "do not relax a shared validator; keep it true
522
+ for every caller" applies across modules too: a validator that changes behavior
523
+ for even one caller is a different validator.
524
+
525
+ When the task explicitly scopes only domain-local extraction and names
526
+ cross-module copies as deferred, do not touch those copies even if they appear
527
+ to share the same name. Record them as deferred work and move on.
528
+
529
+ Audit path-resolution chains when a value passes through multiple resolution
530
+ layers before reaching its final use site. A path written in a config file
531
+ may be resolved relative to the config directory by a config parser, stored
532
+ as a raw string in an internal record, and later resolved again relative to
533
+ the process CWD by a downstream ``resolve_executable``-style function. When
534
+ the two resolution bases differ (config-dir ≠ CWD), the result can point to
535
+ the wrong file or produce a ``FileNotFoundError``. Before committing a
536
+ relative-path fix, trace the full chain: where is the path first resolved,
537
+ where is it stored, where is it used, and what is the CWD at each stage.
538
+
539
+ When code uses both in-process import checks and out-of-process subprocess
540
+ calls, verify that they resolve to the same Python interpreter and package
541
+ set. A preflight that passes in-process (import succeeds with the current
542
+ ``sys.path`` and venv packages) does not prove that a subprocess
543
+ ``["python", "-m", "some.module"]`` will succeed — ``"python"`` may be a
544
+ different interpreter on ``PATH``. Test the exact subprocess command (with
545
+ the same env vars) before concluding the runtime logic is correct. When the
546
+ subprocess command is built by an adapter from configurable fields, prefer
547
+ to make the python executable configurable and default to the venv Python
548
+ when the task's environment context makes that the correct choice.
549
+
167
550
  When an interface forces every caller to pass excessive parameters, consider a
168
551
  small explicit context or config object. Do not turn that into a framework when
169
552
  plain values remain clearer.
@@ -193,6 +576,12 @@ Before writing code, identify the natural owner of the change:
193
576
  and tests;
194
577
  - a loader or manifest change should mainly touch the input layer and tests.
195
578
 
579
+ When public exports, docs, registries, harness runners, generated artifacts, or
580
+ paper-output surfaces are explicitly excluded, treat existing mentions of the
581
+ new helper in those surfaces as defects to remove, not as consistency surfaces
582
+ to update. Tests should import scoped helpers from their owning module when the
583
+ package entrypoint is out of scope.
584
+
196
585
  If one feature requires unrelated edits across many areas, treat that as a
197
586
  framework-boundary risk. Do the smallest local refactor that brings related code
198
587
  together, or report the coupling if a safe local refactor is outside scope.
@@ -201,6 +590,16 @@ Keep code that changes together close. Keep unrelated reasons to change in
201
590
  separate modules. Public/shared layers should contain only stable capabilities
202
591
  needed by multiple users; special cases should stay near their use sites.
203
592
 
593
+ When a fix patches a latent gap at one call site and sibling call sites share
594
+ the same pattern, they share one change reason and belong in the same change.
595
+ Audit siblings before closing the fix: if a second entry point builds and runs
596
+ the same kind of command, the same resolution, validation, or environment wiring
597
+ applies there too. Fix in-scope siblings together so behavior does not diverge
598
+ silently across entry points. When a sibling is out of scope, record the
599
+ divergence explicitly — which site, which gap, why deferred — instead of leaving
600
+ the inconsistency implicit. Change locality is not a reason to fix one site and
601
+ leave an identical latent gap unflagged at a co-located site.
602
+
204
603
  ## Harness And Test Discipline
205
604
 
206
605
  Harnesses serve paper goals, performance comparison, method screening, module
@@ -274,6 +673,20 @@ Keep harness and test responsibilities separate:
274
673
  - export-surface assertions belong in export tests, invalid-state assertions in
275
674
  invalid-state tests, and identity/schema assertions in clearly named identity
276
675
  or schema tests;
676
+ - for a new private helper with two or more mutually exclusive branches (a
677
+ relative-versus-bare path resolver, a presence-versus-absence dispatcher, an
678
+ enabled-versus-disabled switch), pin each branch with its own focused
679
+ assertion unless the project's test design explicitly forbids that test
680
+ category. A branch taken only when a condition holds is untested when every
681
+ fixture takes the other branch; a deterministic test of a pure helper is not
682
+ the same as a forbidden real-execution or real-data test and should not be
683
+ skipped on that basis. A branch-pinning assertion only counts as coverage if
684
+ the repository's normal validation command actually executes it: a doctest
685
+ embedded in a source module, a test guarded by a non-default flag, or a test
686
+ file outside the configured test path is not exercised by the requested
687
+ command and gives no evidence under it. Place the assertion where the suite
688
+ runs it, or explicitly label it as an unexecuted extra guard; do not count an
689
+ unrun doctest toward the green-suite total;
277
690
  - harness code should not become functional test code;
278
691
  - test code should not become paper-performance evaluation.
279
692
 
@@ -302,12 +715,26 @@ surfaces as current, stale, or historical. Edit only stale current surfaces
302
715
  needed for the accepted change. If all requested docs are current, report a
303
716
  no-op docs sync and the readback/search checks that proved it.
304
717
 
718
+ When a bounded harness or smoke API is added and an in-scope harness README,
719
+ contract note, or config/example doc already describes a broader runner,
720
+ registry, paper experiment, or framework-level entrypoint, include that surface
721
+ in the docs map. Either narrow the current entrypoint/output wording to the
722
+ accepted bounded API, or mark the broader text as planned or historical if that
723
+ is true in the repository. Do not leave full-runner or paper-output semantics as
724
+ the current contract for a module-level helper.
725
+
305
726
  When one accepted symbol, artifact, metric, method, or helper is documented in
306
727
  multiple parallel surfaces, build a small surface map before editing: helper or
307
728
  API lists, emitted names, package/module summaries, layout rows, test summaries,
308
729
  and absence clauses. Update each stale parallel surface consistently, but do
309
730
  not add new public exports, runtime behavior, or future-plan claims just because
310
731
  the docs mention the accepted bounded surface.
732
+ For module-level adapter surfaces, map both source/module rosters and test
733
+ rosters. If a scoped test file exists, every current README-style test roster
734
+ that lists comparable sibling tests should include it at the same hierarchy and
735
+ translation level. If a scoped module has no direct test file and the task does
736
+ not request one, do not invent an out-of-scope test solely for roster symmetry;
737
+ report the indirect or absent coverage honestly and keep edits inside scope.
311
738
  Bind the surface map to the current selected subject. Neighboring helpers,
312
739
  methods, tests, metrics, or earlier accepted features in the same document are
313
740
  context, not part of the sync, unless the user explicitly scopes them or the
@@ -423,6 +850,16 @@ it, the current workflow instruction names that handoff, or an existing active
423
850
  trajectory already contains that selected task. Otherwise leave a neutral
424
851
  waiting state such as "no next developer task is selected."
425
852
 
853
+ Listing **candidate next phases** (e.g. "Next slice options: <phase_a> or
854
+ <phase_b>") is not the same as selecting a next task. Candidates document
855
+ the real gaps visible after the current phase completes — they inform a future
856
+ selection decision but do not make one. A memory file that ends with a
857
+ candidate list and no explicit selection is in a neutral waiting state, not a
858
+ selected handoff. When updating memory after completing a phase, advance the
859
+ stale pointer past the completed phase and list the remaining real gaps as
860
+ candidates; do not promote any candidate to "selected next task" unless the
861
+ user, workflow, or active trajectory explicitly selects it.
862
+
426
863
  When the accepted source/test task explicitly excluded docs, TODO, exports,
427
864
  harnesses, experiments, or generated outputs, preserve that exclusion in the
428
865
  trajectory. A later TODO-only pass may record accepted work and verified stale
@@ -546,15 +983,164 @@ problem appears.
546
983
  Use the user's requested validation command when provided. Before running, check
547
984
  that every explicitly requested target exists; a missing target is a blocker to
548
985
  report, not permission to silently narrow the command or create the target.
986
+ Run the command literally from the repository root unless the user gave another
987
+ working directory. Do not substitute a broader suite, omit arguments, add
988
+ environment variables, or rely on an unreported install/import workaround as
989
+ evidence for the requested command. If setup is required before the exact
990
+ command can run, state that setup separately and then rerun and report the exact
991
+ command result.
992
+ When reporting validation, keep each requested command's result separate unless
993
+ you have explicitly deduplicated overlapping tests. Do not invent a combined
994
+ total from overlapping suites, and check that any subtotals you report add up.
995
+ If a command was run twice because suites overlap, say that clearly instead of
996
+ presenting the repeated tests as additional coverage.
997
+ The headline status must match the worst required validation result. Do not say
998
+ "all validations pass", "all clean", "accepted as-is", or "complete" when any
999
+ required command failed, collected errors, or was skipped. If a failure appears
1000
+ pre-existing or out of scope, label the command as failed with a pre-existing or
1001
+ scope caveat; do not convert it into a passing validation summary.
1002
+ If you run extra tests beyond the user's requested validation, label them as
1003
+ extra guards and keep them out of the required-command total. Do not describe an
1004
+ unrequested test file or suite as part of the scoped validation surface unless
1005
+ the task or review explicitly added it.
1006
+
1007
+ For a pure structural move (no logic change), verify byte-level identity between
1008
+ the new file and the original content — via diff, checksum, or file-copy
1009
+ confirmation. When the task allows exactly one import-path line to differ, diff
1010
+ the two files and confirm only that line changed. Verify import direction after
1011
+ the move: the new canonical module should not import through the old shim path,
1012
+ and layered packages should only import from lower layers. An AST-level import
1013
+ scan or a focused grep for `from <wrong_direction> import` catches wrong-way
1014
+ dependencies that tests may not exercise.
549
1015
 
550
1016
  For source or test changes, prefer the smallest relevant test target that proves
551
1017
  the accepted contract, unless the user asked for a broader suite. Use command
552
1018
  forms that avoid repository cache or bytecode artifacts when the project allows.
1019
+ When tests are intentionally removed because they covered an excluded surface,
1020
+ the expected test count should decrease. Treat an unchanged or unexpectedly
1021
+ higher count as a signal to re-check for stale tests. A full-suite pass is useful
1022
+ context but does not replace the exact requested validation result.
1023
+ If the exact validation command names files or directories, those paths are
1024
+ preserved targets unless the user explicitly says to remove them. A missing
1025
+ target is not a successful cleanup; restore the accepted target or report the
1026
+ conflict instead of narrowing the command, deleting dependent modules, or
1027
+ substituting a broader suite.
1028
+ If required validation fails because an out-of-scope dependency, record,
1029
+ adapter, metric, parser, or export contract is missing or incompatible, do not
1030
+ repair that external contract inside the current task. Report a
1031
+ validation-scope conflict with the import or call chain, the smallest candidate
1032
+ scope expansion, and any in-scope validation that still passes. Passing the
1033
+ requested command after unapproved out-of-scope edits is still not a successful
1034
+ completion.
1035
+
1036
+ When a verification or refactor task discovers a **pre-existing architectural
1037
+ violation** — an import direction violation, a transitive runtime dependency, a
1038
+ god-module leak, a structural inconsistency — that predates the current task
1039
+ scope, do not fix it and do not let it block acceptance. Classify it: is it
1040
+ pre-existing (flat-module-era code that was never layered) or task-induced (the
1041
+ current move/split created a new wrong-way dependency)? Pre-existing violations
1042
+ are recording targets: document them in memory as deferred reconciliation work
1043
+ with the specific import chain, the layer rule violated, and the suggested
1044
+ narrow fix. Task-induced violations are blocking defects that must be fixed
1045
+ before acceptance. The test for pre-existing: if reverting the current task's
1046
+ changes would leave the violation intact, it is pre-existing. If the violation
1047
+ appears only because of the current task's module move or import change, it is
1048
+ task-induced.
1049
+ For stabilization tasks, inspect whether the tests still prove the named
1050
+ contract. A green run is weak evidence when assertions were loosened, regression
1051
+ cases were removed, or only happy-path behavior remains. Restore or add focused
1052
+ coverage for every user-named behavior before reporting that the surface was
1053
+ already stable.
1054
+ When a stabilization task says to add or verify focused tests, treat "verify"
1055
+ literally. First map the existing test names, fixtures, and assertions to each
1056
+ user-named behavior. If the behavior is already covered and the exact requested
1057
+ validation passes, do not add duplicate tests just to create activity. Add or
1058
+ restore tests only for missing, weakened, or ambiguous coverage, then report the
1059
+ coverage map and exact command results.
1060
+ When a stabilization pass makes no edits, the coverage map is the main
1061
+ deliverable. Do not stop at "no changes needed" plus a validation table; name
1062
+ the existing source, test, and docs assertions that prove each user-named
1063
+ contract item, and call out any item that is intentionally only indirectly
1064
+ covered.
1065
+ When the scope asks for non-mutation "where applicable", treat it as a concrete
1066
+ coverage item. For each helper or adapter, identify caller-owned inputs that
1067
+ could be mutated, such as mappings, sequences, records, dataclasses, configs, and
1068
+ fixture objects. Add a focused non-mutation assertion for every mutable or
1069
+ caller-owned input, including optional scoped keys or fields, or explicitly
1070
+ report why mutation is not applicable.
553
1071
 
554
1072
  After validation, check for generated cache/build/test artifacts created by the
555
1073
  run and remove only those generated artifacts. Do not clean unrelated dirty or
556
1074
  untracked user work.
557
1075
 
1076
+ Before reporting success after source, test, config, or docs edits, run the
1077
+ repository's normal formatting or whitespace check when one exists. If no
1078
+ project-specific check is known, run the version-control whitespace check, such
1079
+ as `git diff --check`, from the repository root. Treat trailing whitespace,
1080
+ conflict markers, and blank-line-at-EOF warnings as blockers even when all
1081
+ tests pass. This check is especially important after deleting duplicate tests,
1082
+ merging adjacent blocks, or editing the end of a file.
1083
+ Report whitespace checks exactly: "clean" means no output or findings. If the
1084
+ check reports a pre-existing out-of-scope finding, say that it is a separate
1085
+ scope conflict or pre-existing blocker; do not call the command clean and do
1086
+ not fix the out-of-scope file unless the active task permits it.
1087
+ In validation tables, keep the status label and note consistent: use "passed
1088
+ with pre-existing warnings" or "blocked by pre-existing findings" as
1089
+ appropriate, not "clean", whenever the command emits any warning or finding.
1090
+
1091
+ When the task has an explicit allowed-file set or excluded surface list, run a
1092
+ changed-file check before reporting success. The only newly modified,
1093
+ untracked, moved, deleted, or created repository paths should be the allowed
1094
+ paths plus disposable validation artifacts that were removed. Also search for
1095
+ the accepted symbol or capability in excluded surfaces such as package
1096
+ entrypoints, export tests, docs, registries, harnesses, memory/TODO files, and
1097
+ artifact writers when those surfaces were named as exclusions. Passing tests do
1098
+ not override an out-of-scope changed file or stale excluded-surface reference.
1099
+ If you report that a file or capability was removed, verify the file path no
1100
+ longer exists, no untracked placeholder remains, and no package metadata or
1101
+ module entrypoint still references it.
1102
+ If you report that a file was rewritten, replaced, or reduced to a narrower
1103
+ surface, re-open the final file from disk and verify the exact callable
1104
+ signature, key result fields, line-level absence of prohibited imports or
1105
+ helpers, and expected line-count direction before reporting success. Do not rely
1106
+ on an editor buffer, generated patch text, prior session output, or developer
1107
+ report as evidence that the rewrite persisted.
1108
+ For importable modules, also verify that the code loaded by the validation
1109
+ environment is the same final source file when stale behavior has appeared
1110
+ before. Use the language's normal inspection tools when cheap, such as checking
1111
+ the loaded file path and public signature. If the test runner imports an older
1112
+ surface than the file you think you wrote, stop and resolve that mismatch before
1113
+ claiming validation success.
1114
+ When the user gives an absence or stale-reference search, run that command
1115
+ literally. If no command is given, search every plausible surface for the banned
1116
+ capability: tracked and untracked source, tests, docs, package metadata, module
1117
+ entrypoints, examples, configs, and harness descriptors. Do not limit the search
1118
+ to the files you edited or to the modules you expected to change.
1119
+ For docs-only cleanup with an exact absence command and scoped file list, treat
1120
+ that command as the primary acceptance gate. Run it before editing to build the
1121
+ hit list, and rerun it after every docs pass until it returns zero hits across
1122
+ all scoped files. Do not mark unedited files as "unaffected", "already clean",
1123
+ or "persisted" from memory; the final absence output and targeted readback must
1124
+ prove every scoped file is clean. If the absence command still has hits, report
1125
+ the remaining stale surfaces instead of using passing tests as a completion
1126
+ claim.
1127
+ Exact absence searches are literal, not semantic. A banned phrase still fails
1128
+ when it appears in a negative boundary statement, future-plan note, heading, or
1129
+ historical sentence. Rephrase scoped docs so the exact tokens disappear while
1130
+ preserving the boundary meaning, or report a scope conflict if the stale phrase
1131
+ must remain in an out-of-scope historical surface.
1132
+ If an exact absence command matches accepted baseline surfaces or unrelated
1133
+ scoped documentation, keep the original command as the failing task evidence and
1134
+ report the scope conflict. A narrowed or corrected regex can be useful as a
1135
+ diagnostic, but it is not a substitute for the user's required validation
1136
+ command unless the user or reviewer revises the task. Do not remove accepted
1137
+ baseline documentation, exports, tests, or API names only to satisfy an
1138
+ overbroad absence pattern.
1139
+ Rerun absence searches after the final edit, not before the last deletion. For
1140
+ file deletion, verify each requested path individually with a filesystem check;
1141
+ glob output, command success text, or a prior deletion attempt is not proof that
1142
+ the file is absent.
1143
+
558
1144
  For docs-only or TODO-only work, do not run tests unless executable code or
559
1145
  test files changed accidentally. Re-read edited docs/TODO files and run targeted
560
1146
  text searches for the accepted names, stale predecessor names, and broad absence
@@ -650,11 +1236,52 @@ For bounded helpers, verify that the implementation:
650
1236
  - avoids adjacent runtime surfaces such as loaders, registries, exporters,
651
1237
  harnesses, CLI, experiments, or paper outputs unless explicitly in scope.
652
1238
 
1239
+ When reviewing a scoped change with an allowed-file list, compare the actual
1240
+ changed-file list against that list before reviewing behavior. Flag any package
1241
+ export, export-surface test, documentation, registry, harness, artifact writer,
1242
+ TODO/memory note, stash/backup directory, generated file, or adjacent module as
1243
+ blocking when the request excluded it. Do not accept in-repository stash folders
1244
+ as cleanup; out-of-scope work must be removed from the task's worktree state or
1245
+ explicitly separated outside the repository by user-approved workflow.
1246
+ For cleanup reviews, compare deletions against the preservation map and the
1247
+ requested validation command. Deleting a validation target, accepted feature, or
1248
+ unrelated subsystem is blocking even when the stale excluded search becomes
1249
+ clean. Import errors after removing an excluded file usually mean stale public
1250
+ wiring remains; they do not justify deleting the importer or collapsing the
1251
+ larger surface.
1252
+ When the developer claims an excluded capability was removed, verify with the
1253
+ filesystem, full untracked status, and targeted search. Comment-only stubs,
1254
+ empty files, disabled tests, parser branches, package metadata entries, module
1255
+ entrypoints, and docs that still mention the capability are still present
1256
+ surfaces. Treat mismatches between the report and the worktree as blocking
1257
+ until the worktree is the source of truth.
1258
+
1259
+ When a parameter, option, command, or public surface is removed, search its
1260
+ literal name across implementation, tests, docs, configs, and examples, then
1261
+ read the caller chain around remaining hits. A stale signature, forwarded
1262
+ keyword, config key, fixture assertion, or README command is blocking unless it
1263
+ is explicitly outside the task and reported as a scope conflict.
1264
+ For command-like surfaces, search both structural names and user-visible
1265
+ actions: module names, compatibility aliases, entrypoint functions, parser or
1266
+ handler helpers, package metadata keys, command strings, and tests that import
1267
+ or exercise those handlers. A wrapper that only re-exports the removed handler
1268
+ is still the removed surface.
1269
+
1270
+ When reviewing a stabilization report that says no changes were needed, still
1271
+ compare implementation and tests against the accepted contract. Flag drift in
1272
+ call signatures, default arguments, derived keys, empty-input handling,
1273
+ non-mutation or provenance behavior, and removed regression tests even if the
1274
+ requested commands pass.
1275
+
653
1276
  For documentation reviews, compare every newly edited absence clause against
654
1277
  the implemented-surface list, package/module summaries, layout rows, and test
655
1278
  summaries. Treat broad "no <category>" wording as a defect when a narrower
656
1279
  bounded surface in that category is already accepted; ask for the smallest
657
1280
  wording fix instead of reopening source or tests.
1281
+ If a file is meant to be a static config or example, docs should not describe it
1282
+ as a runner, entrypoint, command surface, managed artifact generator, or paper
1283
+ output path. Replace execution-surface wording with the narrow API or config
1284
+ semantics that the task actually accepts.
658
1285
  Also treat leftover duplicated words, duplicated sentence tails, or malformed
659
1286
  negative clauses as docs defects when they change or obscure the intended
660
1287
  scope.
@@ -789,10 +1416,24 @@ After edits, audit:
789
1416
  - no avoidable global state, hidden paths, repeated registration points, or
790
1417
  heavy config burden were added;
791
1418
  - the change stayed local to the natural owner;
1419
+ - stabilization preserved accepted callable signatures, defaults, key
1420
+ derivation, boundary behavior, provenance, non-mutation, and regression
1421
+ coverage;
1422
+ - cleanup preserved validation targets and accepted work instead of deleting
1423
+ dependent modules, tests, or subsystems to avoid stale references;
792
1424
  - harness and test responsibilities remain separate;
793
1425
  - artifact schemas, exporters, docs, and tests agree when any changed;
794
1426
  - framework docs were updated or confirmed current when in scope;
795
1427
  - external reused code has compatible license and attribution;
1428
+ - explicit allowed-file and excluded-surface scope was preserved, including no
1429
+ in-repository stash, backup, memory, TODO, export, docs, harness, registry, or
1430
+ artifact files created as cleanup side effects;
1431
+ - edited files pass the repository whitespace or diff check; test
1432
+ deduplication, file-end edits, and copied docs blocks left no trailing
1433
+ whitespace, conflict markers, or extra blank lines at EOF;
1434
+ - excluded capabilities are absent from source files, tests, docs, package
1435
+ metadata, parser or handler branches, module entrypoints, and untracked files,
1436
+ with no explanatory stubs or placeholders left behind;
796
1437
  - no generated cache/build/test/output/result artifacts were left behind unless
797
1438
  explicitly requested.
798
1439
 
@@ -810,5 +1451,26 @@ Keep the final response concise:
810
1451
  - validation performed, using readback/search checks for docs-only work;
811
1452
  - caveats that affect the user's next action.
812
1453
 
1454
+ A round's real delta is not only source/test edits: a verification or smoke run
1455
+ that wrote on-disk artifacts, a status or memory-record flip, and a docs/config
1456
+ sync all count. Do not open a report with "no changes needed", "no work
1457
+ required", or "already complete and verified on disk" when any such delta
1458
+ exists; "no source/behavior changes" is the precise, sanctioned phrasing when
1459
+ only non-source surfaces moved, and those edits must still be listed as this
1460
+ round's work. Do not sustain a no-op framing by attributing this round's delta
1461
+ to a prior round (for example labeling this round's status flip as "updated
1462
+ previous turn"), and do not list state already on disk before this round as
1463
+ this round's work — mis-attribution in either direction is the same honesty
1464
+ defect. Report only the real delta, verified against the prior on-disk state,
1465
+ not against memory or a task narrative; a verification run plus a status/memory
1466
+ flip is a deliverable, not "no work".
1467
+
1468
+ A pure verification pass — where the task asks to confirm that prior work is
1469
+ intact and the developer inspects state, runs tests, and finds no new edits are
1470
+ needed — is the exception. In that case the report should describe the checks
1471
+ performed and their results, and may state "no new changes made" or "prior work
1472
+ confirmed intact on disk." Do not use this exception to relabel an
1473
+ implementation pass where work was silently skipped.
1474
+
813
1475
  Do not explain skill internals, tool mechanics, or style theory unless the user
814
1476
  asked for a skill optimizer report.