npm - workflow-supervisor - Versions diffs - 0.1.4 → 0.2.0 - Mend

workflow-supervisor 0.1.4 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

package/CHANGELOG.md +30 -0
package/README.md +69 -5
package/bin/workflow-skills.mjs +201 -1
package/docs/artifacts.md +4 -0
package/docs/cli.md +3 -1
package/docs/portable-delegation.md +14 -1
package/docs/skill-reference.md +11 -1
package/docs/troubleshooting.md +24 -0
package/package.json +1 -1
package/schemas/dossier-v1.schema.json +38 -0
package/schemas/worker-report-v1.schema.json +120 -12
package/skills/acceptance-matrix/SKILL.md +114 -2
package/skills/acceptance-matrix/agents/openai.yaml +1 -1
package/skills/dossier-builder/SKILL.md +28 -0
package/skills/work-unit/SKILL.md +46 -6
package/skills/workflow-docs/references/workflow-control.md +58 -6
package/skills/workflow-supervisor/SKILL.md +51 -3
package/skills/workflow-supervisor/agents/openai.yaml +1 -1

package/CHANGELOG.md CHANGED Viewed

@@ -2,6 +2,36 @@
 This changelog was reconstructed from npm publish metadata and git history after the first four package versions were published without GitHub releases or tags.
+## 0.2.0 - 2026-06-23
+Prepared outcome-evaluation verification for npm publication.
+### Added
+- Added capability-aware outcome evaluation to `WorkerReportV1` through `verification_environment` and `outcome_evaluations`.
+- Added row-level outcome verdicts so verifiers can record `CONDITIONAL_PASS` for behavior that is strongly inferred but not fully observable.
+- Added verification capability metadata for checks such as browser snapshots, jsdom renders, API probes, state-machine tests, file snapshots, and static diff inspection.
+- Added acceptance-matrix, dossier-builder, workflow-docs, and README guidance for expected outcomes, evidence strength, invalid PASS conditions, and capability limitations.
+### Changed
+- Treat implementer `PASS` as a claim that must be mapped to source requirements, acceptance rows, outcome evidence, verifier verdicts, and supervisor audit.
+- Treat tests, typecheck, lint, and build as evidence types instead of automatic material-outcome proof.
+- Require material outcome rows to be directly observed as `PASS`, blocked, or explicitly waived before final green status.
+### Fixed
+- Reject top-level `CONDITIONAL_PASS` as an invalid `WorkerReportV1.status`.
+- Reject top-level `PASS` reports when any outcome row is failed, blocked, or only conditionally observed.
+- Reject `PASS` outcome rows without row-mapped evidence and reject unknown verification capabilities.
+- Prevent unavailable browser, visual, live-service, credential, network, or human-review proof from being hidden inside final PASS reports.
+### Verified
+- Expanded delegate CLI tests for conditional outcome rows, missing row evidence, and unknown capabilities.
+- Expanded lifecycle tests to assert outcome verification rules across supervisor, acceptance matrix, dossier, workflow docs, README, troubleshooting, and schema artifacts.
+- Validated the package with `npm run validate` before release prep.
 ## 0.1.4 - 2026-06-19
 Prepared for npm publication.

package/README.md CHANGED Viewed

@@ -10,10 +10,26 @@ Example prompt:
 Use $workflow-supervisor to build a FastAPI Naive RAG demo for a healthcare specialist agent.
 ```
-The correct first response is not code. The correct first response is an intake packet. That is intentional.
+When you explicitly ask for Workflow Supervisor, the correct first response is not code. The correct first response is an intake packet. That is intentional.
 ![Workflow Supervisor hero image showing the supervisor coordinating source corpus, work units, dossiers, roles, loop policy, acceptance, repair, workflow docs, and final outputs](assets/workflow-supervisor-hero.png)
+## Route First
+Use the smallest route that fits the work before choosing a profile.
+| Situation | Route |
+|---|---|
+| Small, clear edit with obvious files and acceptance | Do not use Workflow Supervisor. Execute directly. |
+| Large bounded backlog with clear unit done signals | `lean_work_unit_runner`. |
+| Broad, ambiguous, source-of-truth, delegated, security-sensitive, dirty-state, release, resume, or externally published work | `strict_full_workflow`. |
+| Sequencing, risk review, or backlog shaping only | `planning_only`. |
+| Runnable uncertainty before implementation | Create a discovery or prototype unit first. |
+This route check matters most when Workflow Supervisor was not explicitly invoked. If the task is a tiny direct edit, normal repo work is better than starting a supervisor loop.
+If the user explicitly invokes `workflow-supervisor`, `$workflow-supervisor`, or says to use the skill, do not silently skip it. Select the proportional profile, then keep the overhead as small as that profile allows.
 ## What You Get
 Workflow Supervisor gives you a repeatable workflow for serious agent tasks:
@@ -90,7 +106,7 @@ flowchart TB
 ## What Happens When You Invoke It
-When you explicitly invoke `workflow-supervisor`, `$workflow-supervisor`, or say to use the skill, the first decision is the execution profile:
+When you explicitly invoke `workflow-supervisor`, `$workflow-supervisor`, or say to use the skill, the first supervisor decision is the execution profile:
 - `lean_work_unit_runner`: for large, already-bounded work-unit backlogs where throughput and low memory matter.
 - `strict_full_workflow`: for ambiguous, high-risk, delegated, security-sensitive, source-of-truth, publication, or cross-system work.
@@ -229,8 +245,8 @@ Common workflow files:
 | `.workflow/WORK-UNITS.md` | `work-unit`, `workflow-docs` | Unit list, dependencies, sequencing, blocked units. |
 | `.workflow/DOSSIER.md` or `.workflow/dossiers/*.yaml` | `dossier-builder`, `workflow-docs` | Worker contracts for implementation, verification, repair, or documentation. |
 | `.workflow/WORKER-MAP.md` | `workflow-supervisor`, `worker-roles`, `workflow-docs` | Worker names, roles, transports, native resource ids, lifecycle, reports, close results, blockers. |
-| `.workflow/ACCEPTANCE-MATRIX.md` | `acceptance-matrix`, `workflow-docs` | Evidence rows and material PASS, FAIL, BLOCKED states. |
-| `.workflow/VERIFICATION-REPORT.md` | verifier worker, `acceptance-matrix`, `workflow-docs` | Verification evidence, findings, skipped checks, residual risks. |
+| `.workflow/ACCEPTANCE-MATRIX.md` | `acceptance-matrix`, `workflow-docs` | Evidence rows, outcome evaluation rows, capability limits, and material PASS, FAIL, BLOCKED states. |
+| `.workflow/VERIFICATION-REPORT.md` | verifier worker, `acceptance-matrix`, `workflow-docs` | Verification environment, outcome evidence, findings, skipped checks, residual risks. |
 | `.workflow/REPAIR-TICKETS.md` | repair worker, `workflow-docs` | Repair tasks tied to failed rows or verifier findings. |
 | `.workflow/DECISIONS.md` | supervisor, `workflow-docs` | User decisions, assumptions, reversals, unresolved questions. |
 | `.workflow/HANDOFF.md` | supervisor, `workflow-docs` | Resume pack for another agent or later session. |
@@ -334,6 +350,33 @@ Every delegated worker returns this machine-shaped report:
   "findings": [],
   "blocking_question": null,
   "next_action": "supervisor_review",
+  "verification_environment": {
+    "shell": true,
+    "filesystem": true,
+    "git_diff": true,
+    "browser": false,
+    "playwright_mcp": false,
+    "network": false,
+    "capabilities": ["shell_command", "api_probe", "static_diff_inspection"],
+    "limitations": ["Browser layout was not verified because browser capability was unavailable"]
+  },
+  "outcome_evaluations": [
+    {
+      "id": "A-001",
+      "source_requirement": "User can create a workflow and read it back.",
+      "expected_outcome": "The API persists the workflow and returns it with the expected schema.",
+      "preferred_verification": ["integration_test", "api_probe", "static_diff_inspection"],
+      "available_verification": ["integration_test", "api_probe", "static_diff_inspection"],
+      "evidence_strength": {
+        "strongest_possible": ["integration_test"],
+        "strongest_available": ["integration_test"]
+      },
+      "evidence": ["pytest tests/test_api.py passed"],
+      "invalid_pass_conditions": ["tests only without API behavior assertion", "hardcoded fixture"],
+      "verdict": "PASS",
+      "finding": ""
+    }
+  ],
   "adapter": null,
   "guard": null,
   "reason": null
@@ -342,6 +385,25 @@ Every delegated worker returns this machine-shaped report:
 The supervisor trusts the report shape, not loose prose. A PASS without evidence is invalid. A verifier that edits implementation is invalid. A worker that asks the human directly is converted into a blocker for the supervisor to route.
+Outcome verification treats the implementer report as a claim. A material behavior row needs expected outcome, evidence strength, preferred and available verification capabilities, invalid PASS conditions, and row-mapped evidence. `CONDITIONAL_PASS` is allowed only as a row-level outcome verdict for behavior that is strongly inferred but not fully observable; it is not a top-level `WorkerReportV1.status` and must not be treated as final green status without explicit waiver evidence.
+### Outcome Evaluation
+The implementer report is not the proof of the outcome. It is a structured claim that gives the supervisor something concrete to verify.
+For outcome-bearing work, the verifier maps each material source requirement to:
+- the expected user or system-visible outcome
+- the strongest preferred verification capability
+- the strongest verification capability actually available
+- row-mapped evidence
+- invalid PASS conditions
+- a row verdict
+Valid row verdicts are `PASS`, `FAIL`, `BLOCKED`, and `CONDITIONAL_PASS`. A row can be `CONDITIONAL_PASS` only when the available evidence strongly supports the outcome but a stronger material capability is unavailable. The missing capability and required external check must be recorded.
+The final worker status remains narrower: `PASS`, `FAIL`, or `BLOCKED`. A final `PASS` requires every material outcome row to be `PASS` unless the user explicitly waives or narrows the missing proof.
 For native threads or subagents, the report is only the work result. The supervisor must also close the native resource. For Codex subagents, record the returned `agent_id` and call `close_agent` after the report, timeout, failure, blocker, cancellation, or invalid-output result is captured. Final outcome is blocked while any native worker lacks a close result.
 ## How The Supervisor Talks To Workers
@@ -578,6 +640,8 @@ If you are an agent using this package:
 7. Treat same-session verification as `focused-check` or `self-check`, not `independent-verifier`.
 8. Trust only structured `WorkerReportV1` results from delegated workers.
 9. Treat verifier edits as invalid.
-10. Keep `.workflow/` ignored and local unless the user explicitly asks to publish it.
+10. Treat tests/typecheck/build as evidence types, not automatic outcome proof.
+11. Record capability limits instead of pretending browser, visual, live-service, credential, network, or human-review proof exists.
+12. Keep `.workflow/` ignored and local unless the user explicitly asks to publish it.
 The promise is not magic autonomy. The promise is disciplined supervision: clear setup, bounded work, scoped workers, structured reports, evidence, repair, and a clean final handoff.

package/bin/workflow-skills.mjs CHANGED Viewed

@@ -19,6 +19,26 @@ const AGENTS = new Set([...INSTALLABLE_AGENTS, "generic"]);
 const DELEGATE_AGENTS = new Set(["codex", "claude-code"]);
 const WORKER_ROLES = new Set(["implementer", "verifier", "repair", "documenter"]);
 const REPORT_STATUSES = new Set(["PASS", "FAIL", "BLOCKED"]);
+const OUTCOME_VERDICTS = new Set(["PASS", "FAIL", "BLOCKED", "CONDITIONAL_PASS"]);
+const VERIFICATION_CAPABILITIES = new Set([
+  "static_diff_inspection",
+  "diff_inspection",
+  "shell_command",
+  "unit_test",
+  "integration_test",
+  "contract_test",
+  "data_contract_test",
+  "jsdom_render",
+  "api_probe",
+  "file_snapshot",
+  "generated_html_snapshot",
+  "component_tree_snapshot",
+  "accessibility_tree_snapshot",
+  "state_machine_test",
+  "browser_snapshot",
+  "human_required",
+  "manual_review",
+]);
 const WORKFLOW_STATE_IGNORE_ENTRY = ".workflow/";
 function usage() {
@@ -483,6 +503,7 @@ function parseSimpleYaml(text) {
     }
     const items = [];
+    const object = {};
     for (i += 1; i < lines.length; i += 1) {
       const next = lines[i];
       if (!next.trim() || next.trimStart().startsWith("#")) continue;
@@ -492,8 +513,10 @@ function parseSimpleYaml(text) {
       }
       const item = next.match(/^\s*-\s*(.*)$/);
       if (item) items.push(unquoteScalar(item[1]));
+      const property = next.match(/^\s+([A-Za-z_][A-Za-z0-9_-]*):(?:\s*(.*))?$/);
+      if (property) object[property[1]] = parseDossierScalar(property[2] || "");
     }
-    result[key] = items.length > 0 ? items : "";
+    result[key] = items.length > 0 ? items : Object.keys(object).length > 0 ? object : "";
   }
   return result;
 }
@@ -556,6 +579,14 @@ const DOSSIER_CORE_ARRAY_FIELDS = [
 ];
 const DOSSIER_EXPLICIT_ARRAY_FIELDS = ["assumptions", "open_questions"];
+const FEEDBACK_LOOP_FIELDS = [
+  "command_or_evidence",
+  "red_capable",
+  "exact_symptom_or_behavior",
+  "deterministic",
+  "expected_runtime",
+  "agent_runnable",
+];
 function isPlaceholder(value, { allowNone = false } = {}) {
   const normalized = String(value || "").trim().toLowerCase().replace(/[.!]+$/, "");
@@ -584,6 +615,61 @@ function validateConcreteArray(data, field, errors, options = {}) {
   return values;
 }
+function dossierSearchText(data) {
+  return [
+    data.workflow,
+    data.work_unit,
+    data.title,
+    data.objective,
+    ...fieldArray(data.work_points),
+    ...fieldArray(data.acceptance_matrix),
+    ...fieldArray(data.adversarial_checks),
+    ...fieldArray(data.required_commands_or_evidence),
+    ...fieldArray(data.stop_gates),
+  ]
+    .join(" ")
+    .toLowerCase();
+}
+function dossierNeedsFeedbackLoop(data) {
+  return /\b(bug|fix|regression|defect|broken|crash|error|failure|failing|risky behavior|behavior change|behaviour change|change behavior|change behaviour)\b/.test(
+    dossierSearchText(data),
+  );
+}
+function validateFeedbackLoop(data, warnings) {
+  const loop = data.feedback_loop;
+  const needsLoop = dossierNeedsFeedbackLoop(data);
+  if (!loop) {
+    if (needsLoop) {
+      warnings.push("feedback_loop is recommended for bug-fix or risky behavior-change dossiers");
+    }
+    return;
+  }
+  if (typeof loop !== "object" || Array.isArray(loop)) {
+    warnings.push("feedback_loop should be an object with command_or_evidence, red_capable, exact_symptom_or_behavior, deterministic, expected_runtime, and agent_runnable");
+    return;
+  }
+  for (const field of FEEDBACK_LOOP_FIELDS) {
+    if (isPlaceholder(loop[field])) warnings.push(`feedback_loop.${field} should be concrete`);
+  }
+  if (loop.red_capable && !["yes", "no", "not_applicable"].includes(String(loop.red_capable))) {
+    warnings.push("feedback_loop.red_capable should be yes, no, or not_applicable");
+  }
+  if (loop.deterministic && !["yes", "no"].includes(String(loop.deterministic))) {
+    warnings.push("feedback_loop.deterministic should be yes or no");
+  }
+  if (loop.agent_runnable && !["yes", "no"].includes(String(loop.agent_runnable))) {
+    warnings.push("feedback_loop.agent_runnable should be yes or no");
+  }
+  if (needsLoop && String(loop.red_capable) !== "yes") {
+    warnings.push("bug-fix or risky behavior-change dossiers should name a red-capable feedback loop or explicit waiver");
+  }
+}
 function validateDossierData(data, { role, unitId } = {}) {
   const errors = [];
   const warnings = [];
@@ -649,6 +735,8 @@ function validateDossierData(data, { role, unitId } = {}) {
     if (!/\b[A-Z]+[0-9]+\b/.test(row)) warnings.push(`acceptance_matrix[${index}] should include a stable row ID`);
   });
+  validateFeedbackLoop(data, warnings);
   const unresolved = fieldArray(data.open_questions).filter((item) => !/^(none|no open questions|empty)$/i.test(item));
   if (unresolved.length > 0) {
     errors.push("open_questions must be explicitly none before delegation; create a discovery dossier or stop as BLOCKED");
@@ -877,6 +965,8 @@ function buildWorkerPrompt({ role, unitId, dossierText }) {
         findings: [],
         blocking_question: null,
         next_action: "",
+        verification_environment: null,
+        outcome_evaluations: [],
         adapter: null,
         guard: null,
         reason: null,
@@ -1044,6 +1134,109 @@ function ensureArray(value) {
   return Array.isArray(value) ? value : [value];
 }
+function isPlainObject(value) {
+  return Boolean(value && typeof value === "object" && !Array.isArray(value));
+}
+function validateCapabilityList(value, field, errors) {
+  if (!Array.isArray(value)) {
+    errors.push(`${field} must be an array`);
+    return;
+  }
+  for (const capability of value) {
+    if (!VERIFICATION_CAPABILITIES.has(capability)) {
+      errors.push(`${field} contains unsupported capability: ${capability}`);
+    }
+  }
+}
+function validateStringArray(value, field, errors) {
+  if (!Array.isArray(value)) {
+    errors.push(`${field} must be an array`);
+    return;
+  }
+  if (value.some((item) => typeof item !== "string")) {
+    errors.push(`${field} must contain only strings`);
+  }
+}
+function validateVerificationEnvironment(environment, errors) {
+  if (environment == null) return;
+  if (!isPlainObject(environment)) {
+    errors.push("verification_environment must be an object or null");
+    return;
+  }
+  for (const field of ["shell", "filesystem", "git_diff", "browser", "playwright_mcp", "network"]) {
+    if (environment[field] != null && typeof environment[field] !== "boolean") {
+      errors.push(`verification_environment.${field} must be boolean`);
+    }
+  }
+  if (environment.capabilities != null) {
+    validateCapabilityList(environment.capabilities, "verification_environment.capabilities", errors);
+  }
+  if (environment.limitations != null) {
+    validateStringArray(environment.limitations, "verification_environment.limitations", errors);
+  }
+}
+function validateOutcomeEvaluations(report, errors) {
+  const rows = report?.outcome_evaluations;
+  if (!Array.isArray(rows)) {
+    errors.push("outcome_evaluations must be an array");
+    return;
+  }
+  if (report.status === "PASS" && rows.some((row) => row?.verdict !== "PASS")) {
+    errors.push("top-level PASS requires every outcome_evaluations row verdict to be PASS");
+  }
+  rows.forEach((row, index) => {
+    const prefix = `outcome_evaluations[${index}]`;
+    if (!isPlainObject(row)) {
+      errors.push(`${prefix} must be an object`);
+      return;
+    }
+    for (const field of ["id", "source_requirement", "expected_outcome", "verdict"]) {
+      if (typeof row[field] !== "string" || row[field].trim() === "") {
+        errors.push(`${prefix}.${field} must be a non-empty string`);
+      }
+    }
+    if (!OUTCOME_VERDICTS.has(row.verdict)) {
+      errors.push(`${prefix}.verdict must be PASS, FAIL, BLOCKED, or CONDITIONAL_PASS`);
+    }
+    validateCapabilityList(row.preferred_verification, `${prefix}.preferred_verification`, errors);
+    validateCapabilityList(row.available_verification, `${prefix}.available_verification`, errors);
+    if (!isPlainObject(row.evidence_strength)) {
+      errors.push(`${prefix}.evidence_strength must be an object`);
+    } else {
+      validateCapabilityList(row.evidence_strength.strongest_possible, `${prefix}.evidence_strength.strongest_possible`, errors);
+      validateCapabilityList(row.evidence_strength.strongest_available, `${prefix}.evidence_strength.strongest_available`, errors);
+      if (row.evidence_strength.limitation != null && typeof row.evidence_strength.limitation !== "string") {
+        errors.push(`${prefix}.evidence_strength.limitation must be a string`);
+      }
+    }
+    if (!Array.isArray(row.evidence)) errors.push(`${prefix}.evidence must be an array`);
+    validateStringArray(row.invalid_pass_conditions, `${prefix}.invalid_pass_conditions`, errors);
+    if (row.verdict === "PASS" && Array.isArray(row.evidence) && row.evidence.length === 0) {
+      errors.push(`${prefix}.PASS requires row evidence`);
+    }
+    if (row.verdict === "CONDITIONAL_PASS") {
+      const hasLimitation = typeof row.limitation === "string" && row.limitation.trim() !== "";
+      const hasCapabilityLimitation = Array.isArray(row.capability_limitations) && row.capability_limitations.length > 0;
+      if (!hasLimitation && !hasCapabilityLimitation) {
+        errors.push(`${prefix}.CONDITIONAL_PASS requires limitation or capability_limitations`);
+      }
+    }
+    if (row.capability_limitations != null) {
+      validateStringArray(row.capability_limitations, `${prefix}.capability_limitations`, errors);
+    }
+    if (row.required_external_check != null) {
+      validateStringArray(row.required_external_check, `${prefix}.required_external_check`, errors);
+    }
+    if (row.finding != null && typeof row.finding !== "string") {
+      errors.push(`${prefix}.finding must be a string`);
+    }
+  });
+}
 function reportAdapterMeta(adapter, result = {}) {
   return {
     agent: adapter?.agent || null,
@@ -1069,6 +1262,8 @@ function blockedReport({ role, unitId, reason, summary, adapter, guard, stdout,
     findings: reason ? [{ id: reason, severity: "blocking", summary }] : [],
     blocking_question: null,
     next_action: "supervisor_review",
+    verification_environment: null,
+    outcome_evaluations: [],
     adapter: adapter || null,
     guard: guard || { allowed_surface_violations: [], role_violations: [], warnings: [] },
     reason,
@@ -1092,8 +1287,11 @@ function normalizeReport(report, { role, unitId, adapter, guard }) {
     findings: ensureArray(report.findings),
     blocking_question: report.blocking_question ?? null,
     next_action: report.next_action || "",
+    verification_environment: report.verification_environment ?? null,
+    outcome_evaluations: ensureArray(report.outcome_evaluations),
     adapter,
     guard,
+    reason: report.reason ?? null,
   };
 }
@@ -1112,6 +1310,8 @@ function validateWorkerReport(report, { role, unitId }) {
     errors.push("blocking_question requires BLOCKED status");
   }
   if (role === "verifier" && report?.changed_surfaces?.length > 0) errors.push("verifier must not report changed surfaces");
+  validateVerificationEnvironment(report?.verification_environment, errors);
+  validateOutcomeEvaluations(report, errors);
   return errors;
 }

package/docs/artifacts.md CHANGED Viewed

@@ -44,4 +44,8 @@ Markdown is the default, but state may also be an inline brief, spreadsheet tab,
 For `lean_work_unit_runner`, prefer one compact ledger over multiple workflow documents. Each executable row should carry `id`, `source_ref`, `scope`, `done`, `check`, `status`, touched surfaces, and blockers. Escalated units may link to strict-mode SPEC, dossier, or verification artifacts only when needed.
+For product or integration implementation, `WORK-UNITS.md` and lean ledger rows should also carry `slice_type`, `observable_behavior`, `expected_outcome`, `demo_or_verification`, `layers_touched`, and `horizontal_slice_justification` where useful. Prefer `tracer_bullet` units for behavior work. Use horizontal slices only for prefactoring, migration safety, infrastructure, documentation, research, or risk-boundary work with a concrete justification.
+For outcome-bearing verification, `ACCEPTANCE-MATRIX.md` and `VERIFICATION-REPORT.md` should include a verification environment, outcome evaluation rows, preferred and available verification capabilities, evidence strength, invalid PASS conditions, and any required external checks. Row-level `CONDITIONAL_PASS` means strongly inferred but not fully observable; it must not be treated as final green status without explicit waiver evidence.
 For native thread or subagent delegation, `WORKER-MAP.md` must record the native resource id, terminal report, close action, and close result. Do not mark a native worker closed until the resource close is recorded.

package/docs/cli.md CHANGED Viewed

@@ -122,10 +122,12 @@ Options:
 ### `delegate`
-Run one role-scoped worker through an installed Codex or Claude Code CLI and print exactly one normalized `WorkerReportV1` JSON object. Missing or invalid `DossierV1` contracts, missing CLIs, invalid worker output, timeouts, non-zero PASS results, PASS without evidence, forbidden-surface changes, and verifier mutations become `BLOCKED` reports instead of unstructured prose.
+Run one role-scoped worker through an installed Codex or Claude Code CLI and print exactly one normalized `WorkerReportV1` JSON object. Missing or invalid `DossierV1` contracts, missing CLIs, invalid worker output, timeouts, non-zero PASS results, PASS without evidence, top-level `CONDITIONAL_PASS`, PASS with conditional outcome rows, forbidden-surface changes, and verifier mutations become `BLOCKED` reports instead of unstructured prose.
 The report schema lives at `schemas/worker-report-v1.schema.json`. The Codex adapter passes it via `--output-schema`; the Claude Code adapter passes it via `--json-schema`; every adapter is still wrapper-validated after the run.
+`WorkerReportV1.status` remains `PASS`, `FAIL`, or `BLOCKED`. Outcome-bearing verifier reports may include `verification_environment` and `outcome_evaluations`; `CONDITIONAL_PASS` is allowed only as an outcome row verdict to record strongly inferred but not fully observable behavior.
 `--dossier` is a hard preflight gate. It must parse as `DossierV1` and pass concrete-field checks before the worker process starts. The delegate command uses `allowed_surfaces` and `forbidden_surfaces` from the dossier as surface guards unless explicit CLI surface flags are provided.
 ```bash

package/docs/portable-delegation.md CHANGED Viewed

@@ -80,6 +80,17 @@ Every adapter must normalize into this shape:
   "findings": [],
   "blocking_question": null,
   "next_action": "",
+  "verification_environment": {
+    "shell": true,
+    "filesystem": true,
+    "git_diff": true,
+    "browser": false,
+    "playwright_mcp": false,
+    "network": false,
+    "capabilities": ["shell_command", "api_probe", "static_diff_inspection"],
+    "limitations": []
+  },
+  "outcome_evaluations": [],
   "adapter": {
     "agent": "codex",
     "command": "codex exec",
@@ -93,7 +104,7 @@ Every adapter must normalize into this shape:
 }
 ```
-`PASS`, `FAIL`, and `BLOCKED` mean the same thing on both platforms. A worker report without evidence for material acceptance rows is invalid. Invalid output is converted into a deterministic normalized `BLOCKED` report by default. The package does not make a second live worker call to repair formatting, because a second call can mutate state, consume budget, or produce another non-portable transcript.
+`PASS`, `FAIL`, and `BLOCKED` mean the same thing on both platforms. `CONDITIONAL_PASS` is valid only as a row-level `outcome_evaluations[].verdict`, not as top-level `WorkerReportV1.status`. A worker report without evidence for material acceptance rows is invalid. A top-level PASS with failed, blocked, or conditional outcome rows is invalid. Invalid output is converted into a deterministic normalized `BLOCKED` report by default. The package does not make a second live worker call to repair formatting, because a second call can mutate state, consume budget, or produce another non-portable transcript.
 The schema is a package artifact at `schemas/worker-report-v1.schema.json`. Codex receives it through `--output-schema`; Claude Code receives it through `--json-schema`; both adapters are still wrapper-validated after the run.
@@ -163,6 +174,8 @@ For git workspaces, the surface guard compares pre/post git status. Mutable role
 | Worker hangs | Timeout returns normalized `BLOCKED` with adapter timing evidence. |
 | Worker exits non-zero but printed useful text | Do not trust it as PASS. Normalize as `BLOCKED` unless a valid report and clean guards prove otherwise. |
 | Worker returns PASS without evidence | Invalid report. Return normalized `BLOCKED` with `reason: report_validation_failed`. |
+| Worker returns top-level `CONDITIONAL_PASS` | Invalid report. Use `BLOCKED` or `FAIL` top-level status and record `CONDITIONAL_PASS` only on the affected outcome row. |
+| Worker hides conditional outcome proof inside PASS | Invalid report. Top-level PASS requires every material outcome row verdict to be PASS. |
 | Tests cannot run | Verifier returns `BLOCKED` or `PASS` only with substitute evidence accepted by the acceptance matrix. |
 | Repair expands scope | Reject unless the repair dossier explicitly allowed the new surfaces and criteria. |
 | Units touch same surfaces | Run sequentially. Parallel delegation requires proven disjoint mutable surfaces. |

package/docs/skill-reference.md CHANGED Viewed

@@ -4,6 +4,16 @@
 Coordinate explicit supervised or agent-loop workflows with profile-based overhead. It starts by selecting `lean_work_unit_runner`, `strict_full_workflow`, or `planning_only`, then completes the intake needed for that profile before implementation, goal binding, worker delegation, or final disposition. The user must answer required intake items; the supervisor must not infer path, mode, delegation, final disposition, or boundaries from vague keywords. Lean mode is for large already-bounded work-unit backlogs: it keeps a compact ledger with unit id, source reference, scope, done signal, check, status, touched surfaces, and blockers, then executes one ready unit at a time with targeted checks and escalation gates. Strict mode creates a source-requirement coverage ledger and SPEC review gate before work units so controlling-source deliverables, roadmap phases, and exit criteria are either implemented, explicitly deferred, blocked, or marked non-material. In human-in-loop mode, the human can ask questions, request revisions, block, defer, or approve before execution. In autonomous goal mode, human clarification pauses resume from recorded workflow state after the answer updates only affected downstream artifacts. Strict mode can orchestrate named workers from dossiers through the portable delegate command or an approved native adapter. Native threads and subagents require a recorded native resource id plus a close result, such as `close_agent` for Codex subagents, before a worker is `closed`. Loading the skill itself does not spawn workers. It binds Codex goals only after complete intake and when the user or environment authorizes goal-oriented work, checks active goal state first, avoids unrelated active-goal collisions, and treats terminal blocked goals as history when resuming through workflow docs.
+Route first before profile selection. If Workflow Supervisor was not explicitly invoked and the task is a small, clear edit with obvious files and acceptance, do not use Workflow Supervisor; execute directly. If the user explicitly invokes `workflow-supervisor`, `$workflow-supervisor`, or says to use the skill, select the proportional profile instead of silently skipping the supervisor.
+| Situation | Route |
+|---|---|
+| Small, clear edit with obvious files and acceptance | Do not use Workflow Supervisor. Execute directly. |
+| Large bounded backlog with clear unit done signals | `lean_work_unit_runner`. |
+| Broad, ambiguous, source-of-truth, delegated, security-sensitive, dirty-state, release, resume, or externally published work | `strict_full_workflow`. |
+| Sequencing, risk review, or backlog shaping only | `planning_only`. |
+| Runnable uncertainty before implementation | Create a discovery or prototype unit first. |
 ## `source-corpus`
 Rank and reconcile sources when authority, freshness, contradictions, access rights, or evidence gaps change the safe next action. It supports `allowed_next_action`: discovery, provisional draft, production change, or blocked.
@@ -22,7 +32,7 @@ Define role contracts and solo-mode phase separation. It prevents role bleed: ve
 ## `acceptance-matrix`
-Create formal evidence-mapped acceptance rows for high-risk, supervised, ambiguous, resumable, or delegated workflows. Rows must preserve source requirement strength, including named systems, quantities, live integration language, and exit criteria; weaker proxy checks require explicit user waiver or scope narrowing.
+Create formal evidence-mapped acceptance rows for high-risk, supervised, ambiguous, resumable, or delegated workflows. Rows must preserve source requirement strength, including named systems, quantities, live integration language, and exit criteria; weaker proxy checks require explicit user waiver or scope narrowing. Outcome-bearing rows also name expected outcomes, preferred and available verification capabilities, evidence strength, invalid PASS conditions, and capability limitations. `CONDITIONAL_PASS` is row-level only and must not be treated as final green status without explicit waiver evidence.
 ## `loop-policy`

package/docs/troubleshooting.md CHANGED Viewed

@@ -4,6 +4,12 @@
 Keep `policy.allow_implicit_invocation: false`. Use explicit `$skill-name` invocation until live routing tests prove trigger precision.
+## Workflow Supervisor is used for a tiny edit
+If Workflow Supervisor was not explicitly invoked and the task has obvious files, obvious acceptance, and no hard supervisor trigger, do not invoke the skill. Execute directly and run the relevant check.
+If the user explicitly invoked `workflow-supervisor`, `$workflow-supervisor`, or said to use the skill, do not silently skip it. Select the lightest valid profile, usually `lean_work_unit_runner` for bounded unit work or `planning_only` when the user only needs sequencing, and explain that direct execution would normally fit a tiny edit.
 ## The agent cannot find the skills
 Run:
@@ -33,10 +39,28 @@ Do not remove work units to make the process lean. If a unit cannot name its bou
 Treat this as a lifecycle bug, not a cosmetic cleanup task. A terminal report or completed notification does not close a native Codex subagent. Record every native worker id in `WORKER-MAP.md`, call the native close action such as `close_agent` after the terminal report or blocker is captured, and block the final outcome if any native worker lacks a close result. Prefer one-shot portable delegation when it satisfies the work.
+## Unsupported gauntlet summaries are used as proof
+Unsupported external gauntlet summaries are not validation evidence. Treat them as raw leads only unless they preserve per-scenario reports, commands, artifacts, and expected outcomes that another maintainer can inspect. Use repo-native tests, fixtures, `npm run validate`, and live adapter probes such as `workflow-supervisor delegate-doctor --agent all --probe --require-pass` for real confidence.
 ## Verification rubber-stamps the result
 Use `$acceptance-matrix` for formal evidence rows. A PASS requires row-by-row evidence or explicit waiver evidence.
+## Outcome evidence is only inferred
+Use row-level `CONDITIONAL_PASS` only when the strongest available checks strongly infer the expected outcome but cannot fully observe it. Record the missing capability, limitation, and required external check. Do not roll that row into a final PASS unless the user explicitly accepts the limitation as a waiver or narrowed scope.
+## Browser snapshots are unavailable
+Browser snapshots are a verifier adapter, not the core verification model. If browser, screenshot, Playwright, Storybook, visual diff, or manual-review capability is unavailable, use the strongest available lower-level observable contract such as jsdom render, API probe, state-machine test, file snapshot, route manifest, or static semantic diff inspection. If the source requirement truly depends on browser or visual proof, mark the row BLOCKED or `CONDITIONAL_PASS` with the limitation.
+## Bug fix passes with only related checks
+A related build, lint, broad test run, or inspection is not enough for a bug fix or risky behavior change unless it would catch the exact symptom. Add a red-capable feedback loop with the command, artifact, UI state, or manual check that would fail before the fix and pass after it.
+If no correct test surface exists, record an architecture or verification finding and either block the row or get explicit substitute-evidence waiver from the user. Do not hide this as a skipped check in a PASS report.
 ## A broad roadmap becomes one giant work unit
 Use the source-requirement coverage gate before work-unit finalization. Every material roadmap item, exit criterion, named integration, and numeric target should be mapped to a unit and acceptance row, explicitly deferred by the user, blocked for a decision, or marked non-material with a reason. Do not accept "future work" or residual risk notes as a substitute for work units.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "workflow-supervisor",
-  "version": "0.1.4",
+  "version": "0.2.0",
   "description": "Portable workflow supervision skills for Codex, Claude Code, and generic agent workspaces.",
   "type": "module",
   "repository": {

package/schemas/dossier-v1.schema.json CHANGED Viewed

@@ -114,6 +114,44 @@
     "required_commands_or_evidence": {
       "$ref": "#/$defs/stringList"
     },
+    "feedback_loop": {
+      "type": "object",
+      "required": [
+        "command_or_evidence",
+        "red_capable",
+        "exact_symptom_or_behavior",
+        "deterministic",
+        "expected_runtime",
+        "agent_runnable"
+      ],
+      "additionalProperties": true,
+      "properties": {
+        "command_or_evidence": {
+          "type": "string",
+          "minLength": 1
+        },
+        "red_capable": {
+          "type": "string",
+          "enum": ["yes", "no", "not_applicable"]
+        },
+        "exact_symptom_or_behavior": {
+          "type": "string",
+          "minLength": 1
+        },
+        "deterministic": {
+          "type": "string",
+          "enum": ["yes", "no"]
+        },
+        "expected_runtime": {
+          "type": "string",
+          "minLength": 1
+        },
+        "agent_runnable": {
+          "type": "string",
+          "enum": ["yes", "no"]
+        }
+      }
+    },
     "supervisor_checkpoints": {
       "$ref": "#/$defs/stringList"
     },