npm - @hallucination-studio/harness-engine - Versions diffs - 1.0.0-beta.10.9ff10d9 - Mend

@hallucination-studio/harness-engine 1.0.0-beta.10.9ff10d9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

package/README.md +262 -0
package/bin/install.js +154 -0
package/package.json +31 -0
package/skills/harness-engine/SKILL.md +82 -0
package/skills/harness-engine/agents/openai.yaml +4 -0
package/skills/harness-engine/assets/repo-template/.keep +1 -0
package/skills/harness-engine/assets/sops/.keep +1 -0
package/skills/harness-engine/evals/cases.json +50 -0
package/skills/harness-engine/evals/run_evals.py +1188 -0
package/skills/harness-engine/references/evaluation-loop.md +24 -0
package/skills/harness-engine/references/evidence-first-evals.md +180 -0
package/skills/harness-engine/references/exec-plans.md +51 -0
package/skills/harness-engine/references/file-map.md +17 -0
package/skills/harness-engine/references/knowledge-capture.md +35 -0
package/skills/harness-engine/references/question-catalog.md +29 -0
package/skills/harness-engine/references/sop-index.md +12 -0
package/skills/harness-engine/references/template-policy.md +13 -0
package/skills/harness-engine/references/workflow.md +55 -0
package/skills/harness-engine/scripts/manage_harness.py +2374 -0

package/skills/harness-engine/references/evaluation-loop.md ADDED Viewed

@@ -0,0 +1,24 @@
+# Evaluation Loop
+Use this loop when changing the skill, templates, scripts, or policy references:
+1. Draft the behavior in `SKILL.md`, `references/`, templates, or scripts.
+2. Test it with the deterministic commands in `scripts/manage_harness.py`.
+3. Evaluate it with `python3 evals/run_evals.py`.
+4. Read the structured `harness-eval-report.v1` output: aggregate metrics, per-case results,
+   findings, user message, and recommended actions.
+5. Iterate until the runner passes, the score stays at 100, and failed-case output would be
+   actionable for a user if the eval regressed.
+## What The Evals Cover
+- first-time initialization of an empty repository
+- frontend-aware repository analysis
+- execution-plan and knowledge-capture closure
+- quality gates that block closure and force rework when scores fail
+- phase continuity and workstream recovery for resumable work
+- structured eval report output with per-case findings and recommended actions
+- preservation of unmanaged user-owned docs
+- local harness checks that do not require user-project CI
+Add a new eval case whenever a regression would be easy to miss by reading the files manually.

package/skills/harness-engine/references/evidence-first-evals.md ADDED Viewed

@@ -0,0 +1,180 @@
+# Evidence-First Evals
+Use this reference when a task needs stronger validation than an LLM-written quality estimate.
+The quality score is the final readiness summary, not the eval itself.
+## Core Rule
+Every eval must separate four layers:
+1. **Product contract checks**: machine-readable assertions derived from `product.md`,
+   product specs, acceptance criteria, or the user's prompt.
+2. **Runtime behavior checks**: tests, API smoke checks, CLI checks, browser interactions,
+   and state assertions that prove the implementation works.
+3. **Visual and UX evidence**: screenshots, DOM/accessibility snapshots, responsive viewport
+   checks, and layout invariants for user-facing surfaces.
+4. **Reviewer judgment**: LLM or human scoring only after the first three layers have produced
+   evidence and logged defects.
+If a requirement cannot be checked directly, write down why and replace it with the narrowest
+observable proxy. Do not silently convert it into a vague score.
+## Eval Case Shape
+Model each case like an OpenAI eval sample: stable id, input, expected behavior, recorded events,
+and aggregate metrics.
+Recommended fields:
+- `id`: stable case id, versioned when the case changes materially.
+- `source`: product spec, user request, bug report, design file, or regression source.
+- `risk`: what failure this case is meant to catch.
+- `setup`: fixtures, seed data, feature flags, viewport, network state, or browser route.
+- `actions`: exact commands, API calls, browser actions, or user flows.
+- `assertions`: deterministic checks that must pass.
+- `artifacts`: logs, screenshots, traces, DOM snapshots, accessibility snapshots, or diffs.
+- `defect_policy`: severity and `defect-log` summary to use if the case fails.
+- `metrics`: pass/fail fields and numeric measurements to aggregate.
+Do not accept an eval case whose only assertion is "LLM rates this highly".
+## Product Contract Checks
+Before implementation, extract product requirements into a checklist that can be tested:
+- required capabilities and forbidden capabilities
+- key user workflows and edge cases
+- copy, information architecture, and domain terminology that must appear
+- persistence, permissions, latency, error handling, and empty states
+- explicit non-goals such as "do not add CI" or "do not introduce auth"
+For every product claim in the final answer, there should be a matching command, test, browser
+assertion, artifact, or explicitly documented limitation.
+## Domain Issue Workflows
+Issue triage should be domain-routed before implementation. The generated `AGENTS.md` owns the
+current routing table; use it to decide which durable docs and SOPs to read first.
+Minimum expectations by domain:
+- Product contract: convert requirements, specs, and acceptance criteria into assertions.
+- Frontend/UI: capture browser or local-runtime evidence for the affected workflow and viewport.
+- Backend/runtime: reproduce the behavior narrowly and verify with tests, API smoke checks, logs,
+  or integration evidence.
+- Architecture: document boundary, dependency, data-flow, migration, and compatibility impact.
+- Data/state: verify fixtures, migrations, rollback or compatibility behavior, and data-loss risk.
+- Security/privacy: review sensitive data paths, permissions, auth boundaries, and secret handling.
+- Performance/reliability: collect baseline measurement, repeatable benchmark or smoke evidence,
+  and before/after comparison.
+Confirmed defects or evidence gaps should be logged into the active plan before quality scoring.
+Each `quality-score` dimension must include a concrete evidence note. A numeric score without
+evidence is not a valid readiness signal.
+Use exact evidence when closing knowledge items: the text passed to `knowledge-mark-written`
+must already appear in the durable destination doc. If the destination uses different wording,
+copy a short phrase from that destination into an evidence file and pass `--evidence-file`.
+## Frontend Checks
+For frontend work, use browser evidence instead of relying on a screenshot glance:
+- Open the live route in a browser, not only static file inspection.
+- Capture at least one desktop and one mobile viewport for meaningful UI changes.
+- Assert important text, controls, selected state, loading state, empty state, error state,
+  and primary interaction outcomes from the DOM or accessibility tree.
+- Check layout invariants: no critical overlap, no clipped primary text, stable toolbar/grid
+  dimensions, usable tap targets, and visible focus/selected states.
+- For canvas/WebGL/game UIs, add pixel or scene-state checks so a blank canvas cannot pass.
+- Save screenshots or snapshot paths in the plan or `docs/generated/` when visual evidence
+  matters for later review.
+If the browser tool is unavailable, record the limitation as validation evidence and replace it
+with the strongest available fallback: static DOM checks, component tests, image snapshots, or
+API smoke checks. Do not mark UX as fully validated without saying what was missing.
+## Frontend Issue Reports
+Frontend feedback is an eval trigger even when the harness skill was not explicitly invoked.
+Handle any UI, layout, interaction, responsive behavior, visual state, canvas, or design fidelity
+question through the repository's frontend workflow.
+The correct response is:
+- read `docs/FRONTEND.md`, `docs/DESIGN.md`, and the relevant SOP
+- inspect the affected route, component, viewport, and user workflow
+- reproduce the behavior with browser or local-runtime evidence when possible
+- turn the finding into product/UX assertions or a regression case
+- log confirmed defects or missing evidence in the active plan
+- fix and validate against the same workflow before claiming the UI is acceptable
+Do not answer from memory or aesthetic judgment alone when the question is about a concrete
+frontend behavior.
+## Bug Discovery Evals
+Add regression cases for failures that were previously missed.
+A good bug-discovery eval proves two things:
+- the bad implementation fails a narrow test or observable assertion
+- the harness blocks closure through `defect-log`, `quality-score`, `plan-close`, and `check`
+Track missed-bug classes separately from generic test pass rate. Examples:
+- product-spec drift not detected
+- browser layout defect not detected
+- generated app behavior bug not detected
+- unresolved defect allowed through handoff
+- missing visual evidence accepted as UX validation
+## Metrics
+Record sample-level events first, then aggregate.
+Useful aggregate metrics:
+- `case_pass_rate`: passed cases divided by total cases
+- `product_contract_pass_rate`: product assertions passed divided by product assertions
+- `visual_evidence_coverage`: frontend cases with required screenshots/snapshots
+- `defect_block_rate`: known defects that blocked closure when injected
+- `missed_defect_count`: known defects that reached a passing quality gate
+- `artifact_completeness`: required logs/screenshots/traces present
+- `llm_judge_agreement`: optional reviewer score agreement with labeled cases
+Fail release or handoff when a P0/P1 defect is missed, required product assertions are untested,
+or frontend evidence is absent for meaningful UI work.
+## Report Output
+Eval runners should emit structured JSON that can be shown to users and consumed by tools.
+Use a stable schema name and include both aggregate and per-case results.
+Recommended top-level fields:
+- `schema_version`: stable report schema such as `harness-eval-report.v1`.
+- `status`: `pass` or `fail`.
+- `score`: whole-number aggregate score from `0` to `100`.
+- `summary`: passed, failed, total, and one concise message.
+- `metrics`: named aggregate metrics, not only one score.
+- `case_results`: one object per case with `id`, `description`, `status`, `score`,
+  `duration_seconds`, `findings`, and `recommended_actions`.
+- `user_message`: direct text the agent can relay to the user.
+- `recommended_actions`: deduplicated next actions for failed cases.
+Failure output must name the specific failed case, failed assertion or evidence gap, and the next
+action. Passing output should still include per-case scores so the user can see what was actually
+covered.
+## Meta-Eval Calibration
+When an LLM judge is used, keep a small labeled meta-eval set:
+- examples that should pass
+- examples that should fail product correctness
+- examples that should fail visual/UX evidence
+- examples with open defects that must block handoff
+Run the judge against these labels and treat disagreement as an eval bug. The judge may summarize
+evidence and suggest risks, but it must not override deterministic failures.

package/skills/harness-engine/references/exec-plans.md ADDED Viewed

@@ -0,0 +1,51 @@
+# Execution Plans
+Execution plans are required for multi-step work, risky changes, or tasks that need coordination across files.
+## When To Create One
+- more than one implementation step is required
+- validation is non-trivial
+- architecture, product, reliability, or security decisions are involved
+- work will span enough time that another agent may resume it later
+## Location
+- Workstream recovery ledger: `docs/exec-plans/workstreams.md`
+- Active: `docs/exec-plans/active/`
+- Completed: `docs/exec-plans/completed/`
+## Minimum Sections
+- goal
+- scope
+- constraints
+- steps
+- validation
+- quality gate
+- defects to resolve
+- rework required
+- phase continuity
+- durable knowledge to capture
+- completion notes
+## Operating Rule
+Update the active plan during the work. When the work is done, score it, complete any required rework, record phase continuity for resumable work, move it to `completed`, and leave behind any durable facts in the right permanent docs.
+Before scoring or closing, replace generic starter text with task-specific content. Do not leave placeholders such as "Define in-scope work", "Add the first concrete step", or "Describe how the work will be verified". The default unused durable-knowledge line may remain open, but any real knowledge TODO must be logged, written, and marked complete.
+## Closed Loop
+Use the script, not ad hoc manual edits, for the lifecycle:
+- `plan-start`: create a new active execution plan
+- `knowledge-log`: append a durable fact that still needs to be written into permanent docs and return its stable id; use `--fact-file` for shell-sensitive facts
+- `knowledge-mark-written`: verify and mark a logged fact as written into its permanent doc; evidence must be exact text already present in the destination doc; prefer `--id <knowledge-id> --evidence-file <file>` for shell-sensitive evidence, and use `--append` only to append the exact fact first
+- `defect-log`: record a bug found by validation, evals, browser testing, or code review; this forces the quality gate to fail and makes the defect the next rework input
+- `defect-resolve`: mark a logged defect fixed with validation or code evidence; re-run validation and `quality-score` before closing
+- `quality-score`: write a scored quality gate into the plan; every dimension must include an evidence note; if it fails, the generated `## Rework Required` section becomes the next implementation input
+- `phase-set`: declare whether phased or resumable work continues, pauses, stops, or completes
+- `workstream-upsert`: update `docs/exec-plans/workstreams.md` so interrupted work can be recovered without chat history
+- `plan-close`: refuse to close cleanly until the quality gate passes, phase continuity is recorded, and the listed knowledge items are marked as written to durable docs
+- `check`: run a local handoff check without requiring target-repo CI

package/skills/harness-engine/references/file-map.md ADDED Viewed

@@ -0,0 +1,17 @@
+# File Map
+- `AGENTS.md`: short router, reading order, repo-specific guardrails
+- `ARCHITECTURE.md`: domain boundaries, runtime topology, integration seams
+- `docs/PLANS.md`: plan lifecycle and storage rules
+- `docs/PRODUCT_SENSE.md`: product heuristics and tradeoff rules
+- `docs/QUALITY_SCORE.md`: quality rubric by domain and layer
+- `docs/RELIABILITY.md`: SLOs, failure modes, observability expectations
+- `docs/SECURITY.md`: security constraints, secrets, auth, data handling
+- `docs/DESIGN.md`: design principles and review heuristics
+- `docs/FRONTEND.md`: frontend stack conventions and validation loop
+- `docs/design-docs/`: durable design decisions
+- `docs/product-specs/`: durable product specs
+- `docs/exec-plans/`: active plans, completed plans, and tech debt tracker
+- `docs/sops/`: mechanical procedures for recurring workflows and validation loops
+- `docs/generated/`: generated evidence and facts such as schemas, browser screenshots, DOM snapshots, layout summaries, and smoke outputs; use `evidence-prune` to preview stale unreferenced artifacts before deleting
+- `docs/references/`: external references rewritten or linked for model-friendly discovery

package/skills/harness-engine/references/knowledge-capture.md ADDED Viewed

@@ -0,0 +1,35 @@
+# Knowledge Capture
+Write durable knowledge into the repository whenever one of these is true:
+- the fact changed your implementation plan
+- the fact would likely be needed by another agent later
+- the fact came from a human answer rather than directly from code
+- the fact explains why a policy, architecture choice, or validation loop exists
+- the fact would be annoying to rediscover from scratch
+## Where To Write It
+- Product behavior or workflow intent: `docs/product-specs/`
+- Design rationale or UX rules: `docs/design-docs/`
+- Runtime validation, incidents, or observability loops: `docs/RELIABILITY.md` or `docs/sops/`
+- Security constraints or review gates: `docs/SECURITY.md`
+- Architecture boundaries or integration seams: `ARCHITECTURE.md`
+- Reusable external material: `docs/references/`
+## Minimum Rule
+If a useful fact would otherwise live only in chat, move it into the repo before closing the task.
+## Closed Loop
+Prefer the script workflow:
+1. Log the fact into the active execution plan with `knowledge-log`.
+2. Write the fact into its permanent destination doc.
+3. Mark the plan item complete with `knowledge-mark-written --id <knowledge-id> --evidence "<verbatim text present in durable doc>"`.
+4. Close the plan with `plan-close`.
+`knowledge-log` returns a stable id. Prefer id-based closure so permanent docs can use concise, natural wording rather than duplicating the exact plan fact.
+`knowledge-mark-written` verifies that the destination file contains either the provided evidence text or the exact fact. Evidence must be copied from the destination doc; a summary such as "the doc now states this rule" is rejected unless that exact sentence is in the doc. Use `--append` only when the exact fact should be appended to the destination doc by the tool.

package/skills/harness-engine/references/question-catalog.md ADDED Viewed

@@ -0,0 +1,29 @@
+# Question Catalog
+Use these prompts only when the repo analysis cannot answer them.
+## Product
+- What core user outcome does this repository serve?
+- Which flows matter enough to deserve explicit product specs first?
+- Which non-goals should the harness make visible?
+## Reliability
+- What failure is unacceptable in production?
+- What recovery time or uptime expectation matters most?
+- Which runtime environments must be validated locally before merge?
+## Security
+- Does the repo handle credentials, customer data, regulated data, or privileged actions?
+- Are there required review gates for authentication, authorization, or secrets handling?
+## Frontend
+- Is the product expected to have a polished user-facing interface, an internal tool UI, or no frontend?
+- Which browsers, devices, or accessibility expectations are non-negotiable?
+## References
+- Which external docs are worth copying into `docs/references/` because the team uses them repeatedly?

package/skills/harness-engine/references/sop-index.md ADDED Viewed

@@ -0,0 +1,12 @@
+# SOP Index
+Choose an SOP whenever the task touches one of these areas:
+- architecture or layering changes: `docs/sops/layered-domain-architecture-setup.md`
+- missing durable repository knowledge: `docs/sops/encode-unseen-knowledge.md`
+- runtime debugging or observability setup: `docs/sops/local-observability-feedback-loop.md`
+- user interface work: `docs/sops/chrome-devtools-ui-validation-loop.md`
+- product correctness, frontend layout, or bug-discovery evals: `docs/sops/evidence-first-eval-loop.md`
+- backend behavior, architecture boundaries, data/state, security, or performance issue triage: start from the Issue Workflows in `AGENTS.md`, then follow the domain docs listed there
+If no SOP exists for a recurring workflow, create one in `docs/sops/` as part of the task.

package/skills/harness-engine/references/template-policy.md ADDED Viewed

@@ -0,0 +1,13 @@
+# Template Policy
+Every generated file starts with a managed marker:
+`<!-- harness-engine:managed -->`
+Init behavior:
+- `init`: create missing files for new repositories; when an existing managed harness is detected, refresh managed files and create missing files while preserving unmanaged files
+Use `init` as the normal workspace command so creation and reconciliation share one path. Use `--force` only when the human explicitly accepts overwriting.
+If a file exists without the managed marker, treat it as user-owned unless the human explicitly asks to replace it.

package/skills/harness-engine/references/workflow.md ADDED Viewed

@@ -0,0 +1,55 @@
+# Workflow
+Use this skill in two passes.
+## Pass 1: Analyze and Confirm
+Run `analyze` before editing repository docs.
+Ask the human only about facts that cannot be derived safely from the repo, especially:
+- product domain and top-level outcomes
+- intended users or operators
+- production reliability expectations
+- security or compliance constraints
+- frontend experience bar
+- canonical external references worth pinning inside `docs/references/`
+Do not ask for facts that can be inferred from source layout, dependency manifests, or existing docs.
+Also inspect the analysis for:
+- missing durable knowledge that should be written during the task
+- missing execution-plan state
+- which SOPs should be referenced in the generated router docs
+## Pass 2: Init
+Run `sample-answers`, fill the answers, then run `init`.
+Use `init` for both first-time adoption and managed-harness reconciliation. It creates a new harness when none exists, and refreshes managed harness files plus backfills newly introduced managed files when an existing managed harness is detected. Unmanaged user files are preserved unless `--force` is explicitly used.
+After the script runs, read the generated docs once and tighten weak generic phrases before handing off.
+## Ongoing Use
+After the scaffold exists:
+- read `docs/exec-plans/workstreams.md` before resuming interrupted or long-running work
+- create an execution plan before multi-step work
+- use `plan-start` instead of creating plan files manually when possible
+- log durable facts during execution instead of waiting until the end
+- follow the matching SOP for architecture, UI, observability, or knowledge capture work
+- route product, frontend, backend, architecture, data/state, security, performance, and reliability questions through the Issue Workflows in `AGENTS.md`, even when the user did not invoke the harness skill by name
+- encode durable knowledge back into the repository before closing the task
+- mark logged knowledge items as written after updating the permanent docs; the `knowledge-mark-written` evidence must be exact text already present in the destination doc, not a paraphrase
+- log every defect found by tests, evals, browser validation, or code review with `defect-log`
+- resolve logged defects only after fixing the implementation and citing passing validation with `defect-resolve`
+- run `quality-score` after implementation and validation, with evidence notes for every dimension
+- if `quality-score` fails, implement the `## Rework Required` items and score again
+- use `phase-set` and `workstream-upsert` when a plan belongs to phased or resumable work
+- use `plan-close` to verify no durable knowledge is left stranded in the active plan
+- before `plan-close`, replace generic plan placeholders with task-specific scope, constraints, steps, validation, and completion notes; delete unused ad hoc durable-knowledge TODOs
+- run `.codex/skills/harness-engine/scripts/manage_harness.py check --repo <target-repo>` before handoff
+- preview stale generated evidence with `evidence-prune` when `docs/generated/` contains old screenshots, DOM dumps, layout summaries, or smoke outputs; review the dry-run output before using `--apply`
+- do not add CI to the target repository unless the human explicitly asks for it