@hallucination-studio/harness-engine 1.0.0-beta.10.9ff10d9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,24 @@
1
+ # Evaluation Loop
2
+
3
+ Use this loop when changing the skill, templates, scripts, or policy references:
4
+
5
+ 1. Draft the behavior in `SKILL.md`, `references/`, templates, or scripts.
6
+ 2. Test it with the deterministic commands in `scripts/manage_harness.py`.
7
+ 3. Evaluate it with `python3 evals/run_evals.py`.
8
+ 4. Read the structured `harness-eval-report.v1` output: aggregate metrics, per-case results,
9
+ findings, user message, and recommended actions.
10
+ 5. Iterate until the runner passes, the score stays at 100, and failed-case output would be
11
+ actionable for a user if the eval regressed.
12
+
13
+ ## What The Evals Cover
14
+
15
+ - first-time initialization of an empty repository
16
+ - frontend-aware repository analysis
17
+ - execution-plan and knowledge-capture closure
18
+ - quality gates that block closure and force rework when scores fail
19
+ - phase continuity and workstream recovery for resumable work
20
+ - structured eval report output with per-case findings and recommended actions
21
+ - preservation of unmanaged user-owned docs
22
+ - local harness checks that do not require user-project CI
23
+
24
+ Add a new eval case whenever a regression would be easy to miss by reading the files manually.
@@ -0,0 +1,180 @@
1
+ # Evidence-First Evals
2
+
3
+ Use this reference when a task needs stronger validation than an LLM-written quality estimate.
4
+ The quality score is the final readiness summary, not the eval itself.
5
+
6
+ ## Core Rule
7
+
8
+ Every eval must separate four layers:
9
+
10
+ 1. **Product contract checks**: machine-readable assertions derived from `product.md`,
11
+ product specs, acceptance criteria, or the user's prompt.
12
+ 2. **Runtime behavior checks**: tests, API smoke checks, CLI checks, browser interactions,
13
+ and state assertions that prove the implementation works.
14
+ 3. **Visual and UX evidence**: screenshots, DOM/accessibility snapshots, responsive viewport
15
+ checks, and layout invariants for user-facing surfaces.
16
+ 4. **Reviewer judgment**: LLM or human scoring only after the first three layers have produced
17
+ evidence and logged defects.
18
+
19
+ If a requirement cannot be checked directly, write down why and replace it with the narrowest
20
+ observable proxy. Do not silently convert it into a vague score.
21
+
22
+ ## Eval Case Shape
23
+
24
+ Model each case like an OpenAI eval sample: stable id, input, expected behavior, recorded events,
25
+ and aggregate metrics.
26
+
27
+ Recommended fields:
28
+
29
+ - `id`: stable case id, versioned when the case changes materially.
30
+ - `source`: product spec, user request, bug report, design file, or regression source.
31
+ - `risk`: what failure this case is meant to catch.
32
+ - `setup`: fixtures, seed data, feature flags, viewport, network state, or browser route.
33
+ - `actions`: exact commands, API calls, browser actions, or user flows.
34
+ - `assertions`: deterministic checks that must pass.
35
+ - `artifacts`: logs, screenshots, traces, DOM snapshots, accessibility snapshots, or diffs.
36
+ - `defect_policy`: severity and `defect-log` summary to use if the case fails.
37
+ - `metrics`: pass/fail fields and numeric measurements to aggregate.
38
+
39
+ Do not accept an eval case whose only assertion is "LLM rates this highly".
40
+
41
+ ## Product Contract Checks
42
+
43
+ Before implementation, extract product requirements into a checklist that can be tested:
44
+
45
+ - required capabilities and forbidden capabilities
46
+ - key user workflows and edge cases
47
+ - copy, information architecture, and domain terminology that must appear
48
+ - persistence, permissions, latency, error handling, and empty states
49
+ - explicit non-goals such as "do not add CI" or "do not introduce auth"
50
+
51
+ For every product claim in the final answer, there should be a matching command, test, browser
52
+ assertion, artifact, or explicitly documented limitation.
53
+
54
+ ## Domain Issue Workflows
55
+
56
+ Issue triage should be domain-routed before implementation. The generated `AGENTS.md` owns the
57
+ current routing table; use it to decide which durable docs and SOPs to read first.
58
+
59
+ Minimum expectations by domain:
60
+
61
+ - Product contract: convert requirements, specs, and acceptance criteria into assertions.
62
+ - Frontend/UI: capture browser or local-runtime evidence for the affected workflow and viewport.
63
+ - Backend/runtime: reproduce the behavior narrowly and verify with tests, API smoke checks, logs,
64
+ or integration evidence.
65
+ - Architecture: document boundary, dependency, data-flow, migration, and compatibility impact.
66
+ - Data/state: verify fixtures, migrations, rollback or compatibility behavior, and data-loss risk.
67
+ - Security/privacy: review sensitive data paths, permissions, auth boundaries, and secret handling.
68
+ - Performance/reliability: collect baseline measurement, repeatable benchmark or smoke evidence,
69
+ and before/after comparison.
70
+
71
+ Confirmed defects or evidence gaps should be logged into the active plan before quality scoring.
72
+ Each `quality-score` dimension must include a concrete evidence note. A numeric score without
73
+ evidence is not a valid readiness signal.
74
+
75
+ Use exact evidence when closing knowledge items: the text passed to `knowledge-mark-written`
76
+ must already appear in the durable destination doc. If the destination uses different wording,
77
+ copy a short phrase from that destination into an evidence file and pass `--evidence-file`.
78
+
79
+ ## Frontend Checks
80
+
81
+ For frontend work, use browser evidence instead of relying on a screenshot glance:
82
+
83
+ - Open the live route in a browser, not only static file inspection.
84
+ - Capture at least one desktop and one mobile viewport for meaningful UI changes.
85
+ - Assert important text, controls, selected state, loading state, empty state, error state,
86
+ and primary interaction outcomes from the DOM or accessibility tree.
87
+ - Check layout invariants: no critical overlap, no clipped primary text, stable toolbar/grid
88
+ dimensions, usable tap targets, and visible focus/selected states.
89
+ - For canvas/WebGL/game UIs, add pixel or scene-state checks so a blank canvas cannot pass.
90
+ - Save screenshots or snapshot paths in the plan or `docs/generated/` when visual evidence
91
+ matters for later review.
92
+
93
+ If the browser tool is unavailable, record the limitation as validation evidence and replace it
94
+ with the strongest available fallback: static DOM checks, component tests, image snapshots, or
95
+ API smoke checks. Do not mark UX as fully validated without saying what was missing.
96
+
97
+ ## Frontend Issue Reports
98
+
99
+ Frontend feedback is an eval trigger even when the harness skill was not explicitly invoked.
100
+ Handle any UI, layout, interaction, responsive behavior, visual state, canvas, or design fidelity
101
+ question through the repository's frontend workflow.
102
+
103
+ The correct response is:
104
+
105
+ - read `docs/FRONTEND.md`, `docs/DESIGN.md`, and the relevant SOP
106
+ - inspect the affected route, component, viewport, and user workflow
107
+ - reproduce the behavior with browser or local-runtime evidence when possible
108
+ - turn the finding into product/UX assertions or a regression case
109
+ - log confirmed defects or missing evidence in the active plan
110
+ - fix and validate against the same workflow before claiming the UI is acceptable
111
+
112
+ Do not answer from memory or aesthetic judgment alone when the question is about a concrete
113
+ frontend behavior.
114
+
115
+ ## Bug Discovery Evals
116
+
117
+ Add regression cases for failures that were previously missed.
118
+
119
+ A good bug-discovery eval proves two things:
120
+
121
+ - the bad implementation fails a narrow test or observable assertion
122
+ - the harness blocks closure through `defect-log`, `quality-score`, `plan-close`, and `check`
123
+
124
+ Track missed-bug classes separately from generic test pass rate. Examples:
125
+
126
+ - product-spec drift not detected
127
+ - browser layout defect not detected
128
+ - generated app behavior bug not detected
129
+ - unresolved defect allowed through handoff
130
+ - missing visual evidence accepted as UX validation
131
+
132
+ ## Metrics
133
+
134
+ Record sample-level events first, then aggregate.
135
+
136
+ Useful aggregate metrics:
137
+
138
+ - `case_pass_rate`: passed cases divided by total cases
139
+ - `product_contract_pass_rate`: product assertions passed divided by product assertions
140
+ - `visual_evidence_coverage`: frontend cases with required screenshots/snapshots
141
+ - `defect_block_rate`: known defects that blocked closure when injected
142
+ - `missed_defect_count`: known defects that reached a passing quality gate
143
+ - `artifact_completeness`: required logs/screenshots/traces present
144
+ - `llm_judge_agreement`: optional reviewer score agreement with labeled cases
145
+
146
+ Fail release or handoff when a P0/P1 defect is missed, required product assertions are untested,
147
+ or frontend evidence is absent for meaningful UI work.
148
+
149
+ ## Report Output
150
+
151
+ Eval runners should emit structured JSON that can be shown to users and consumed by tools.
152
+ Use a stable schema name and include both aggregate and per-case results.
153
+
154
+ Recommended top-level fields:
155
+
156
+ - `schema_version`: stable report schema such as `harness-eval-report.v1`.
157
+ - `status`: `pass` or `fail`.
158
+ - `score`: whole-number aggregate score from `0` to `100`.
159
+ - `summary`: passed, failed, total, and one concise message.
160
+ - `metrics`: named aggregate metrics, not only one score.
161
+ - `case_results`: one object per case with `id`, `description`, `status`, `score`,
162
+ `duration_seconds`, `findings`, and `recommended_actions`.
163
+ - `user_message`: direct text the agent can relay to the user.
164
+ - `recommended_actions`: deduplicated next actions for failed cases.
165
+
166
+ Failure output must name the specific failed case, failed assertion or evidence gap, and the next
167
+ action. Passing output should still include per-case scores so the user can see what was actually
168
+ covered.
169
+
170
+ ## Meta-Eval Calibration
171
+
172
+ When an LLM judge is used, keep a small labeled meta-eval set:
173
+
174
+ - examples that should pass
175
+ - examples that should fail product correctness
176
+ - examples that should fail visual/UX evidence
177
+ - examples with open defects that must block handoff
178
+
179
+ Run the judge against these labels and treat disagreement as an eval bug. The judge may summarize
180
+ evidence and suggest risks, but it must not override deterministic failures.
@@ -0,0 +1,51 @@
1
+ # Execution Plans
2
+
3
+ Execution plans are required for multi-step work, risky changes, or tasks that need coordination across files.
4
+
5
+ ## When To Create One
6
+
7
+ - more than one implementation step is required
8
+ - validation is non-trivial
9
+ - architecture, product, reliability, or security decisions are involved
10
+ - work will span enough time that another agent may resume it later
11
+
12
+ ## Location
13
+
14
+ - Workstream recovery ledger: `docs/exec-plans/workstreams.md`
15
+ - Active: `docs/exec-plans/active/`
16
+ - Completed: `docs/exec-plans/completed/`
17
+
18
+ ## Minimum Sections
19
+
20
+ - goal
21
+ - scope
22
+ - constraints
23
+ - steps
24
+ - validation
25
+ - quality gate
26
+ - defects to resolve
27
+ - rework required
28
+ - phase continuity
29
+ - durable knowledge to capture
30
+ - completion notes
31
+
32
+ ## Operating Rule
33
+
34
+ Update the active plan during the work. When the work is done, score it, complete any required rework, record phase continuity for resumable work, move it to `completed`, and leave behind any durable facts in the right permanent docs.
35
+
36
+ Before scoring or closing, replace generic starter text with task-specific content. Do not leave placeholders such as "Define in-scope work", "Add the first concrete step", or "Describe how the work will be verified". The default unused durable-knowledge line may remain open, but any real knowledge TODO must be logged, written, and marked complete.
37
+
38
+ ## Closed Loop
39
+
40
+ Use the script, not ad hoc manual edits, for the lifecycle:
41
+
42
+ - `plan-start`: create a new active execution plan
43
+ - `knowledge-log`: append a durable fact that still needs to be written into permanent docs and return its stable id; use `--fact-file` for shell-sensitive facts
44
+ - `knowledge-mark-written`: verify and mark a logged fact as written into its permanent doc; evidence must be exact text already present in the destination doc; prefer `--id <knowledge-id> --evidence-file <file>` for shell-sensitive evidence, and use `--append` only to append the exact fact first
45
+ - `defect-log`: record a bug found by validation, evals, browser testing, or code review; this forces the quality gate to fail and makes the defect the next rework input
46
+ - `defect-resolve`: mark a logged defect fixed with validation or code evidence; re-run validation and `quality-score` before closing
47
+ - `quality-score`: write a scored quality gate into the plan; every dimension must include an evidence note; if it fails, the generated `## Rework Required` section becomes the next implementation input
48
+ - `phase-set`: declare whether phased or resumable work continues, pauses, stops, or completes
49
+ - `workstream-upsert`: update `docs/exec-plans/workstreams.md` so interrupted work can be recovered without chat history
50
+ - `plan-close`: refuse to close cleanly until the quality gate passes, phase continuity is recorded, and the listed knowledge items are marked as written to durable docs
51
+ - `check`: run a local handoff check without requiring target-repo CI
@@ -0,0 +1,17 @@
1
+ # File Map
2
+
3
+ - `AGENTS.md`: short router, reading order, repo-specific guardrails
4
+ - `ARCHITECTURE.md`: domain boundaries, runtime topology, integration seams
5
+ - `docs/PLANS.md`: plan lifecycle and storage rules
6
+ - `docs/PRODUCT_SENSE.md`: product heuristics and tradeoff rules
7
+ - `docs/QUALITY_SCORE.md`: quality rubric by domain and layer
8
+ - `docs/RELIABILITY.md`: SLOs, failure modes, observability expectations
9
+ - `docs/SECURITY.md`: security constraints, secrets, auth, data handling
10
+ - `docs/DESIGN.md`: design principles and review heuristics
11
+ - `docs/FRONTEND.md`: frontend stack conventions and validation loop
12
+ - `docs/design-docs/`: durable design decisions
13
+ - `docs/product-specs/`: durable product specs
14
+ - `docs/exec-plans/`: active plans, completed plans, and tech debt tracker
15
+ - `docs/sops/`: mechanical procedures for recurring workflows and validation loops
16
+ - `docs/generated/`: generated evidence and facts such as schemas, browser screenshots, DOM snapshots, layout summaries, and smoke outputs; use `evidence-prune` to preview stale unreferenced artifacts before deleting
17
+ - `docs/references/`: external references rewritten or linked for model-friendly discovery
@@ -0,0 +1,35 @@
1
+ # Knowledge Capture
2
+
3
+ Write durable knowledge into the repository whenever one of these is true:
4
+
5
+ - the fact changed your implementation plan
6
+ - the fact would likely be needed by another agent later
7
+ - the fact came from a human answer rather than directly from code
8
+ - the fact explains why a policy, architecture choice, or validation loop exists
9
+ - the fact would be annoying to rediscover from scratch
10
+
11
+ ## Where To Write It
12
+
13
+ - Product behavior or workflow intent: `docs/product-specs/`
14
+ - Design rationale or UX rules: `docs/design-docs/`
15
+ - Runtime validation, incidents, or observability loops: `docs/RELIABILITY.md` or `docs/sops/`
16
+ - Security constraints or review gates: `docs/SECURITY.md`
17
+ - Architecture boundaries or integration seams: `ARCHITECTURE.md`
18
+ - Reusable external material: `docs/references/`
19
+
20
+ ## Minimum Rule
21
+
22
+ If a useful fact would otherwise live only in chat, move it into the repo before closing the task.
23
+
24
+ ## Closed Loop
25
+
26
+ Prefer the script workflow:
27
+
28
+ 1. Log the fact into the active execution plan with `knowledge-log`.
29
+ 2. Write the fact into its permanent destination doc.
30
+ 3. Mark the plan item complete with `knowledge-mark-written --id <knowledge-id> --evidence "<verbatim text present in durable doc>"`.
31
+ 4. Close the plan with `plan-close`.
32
+
33
+ `knowledge-log` returns a stable id. Prefer id-based closure so permanent docs can use concise, natural wording rather than duplicating the exact plan fact.
34
+
35
+ `knowledge-mark-written` verifies that the destination file contains either the provided evidence text or the exact fact. Evidence must be copied from the destination doc; a summary such as "the doc now states this rule" is rejected unless that exact sentence is in the doc. Use `--append` only when the exact fact should be appended to the destination doc by the tool.
@@ -0,0 +1,29 @@
1
+ # Question Catalog
2
+
3
+ Use these prompts only when the repo analysis cannot answer them.
4
+
5
+ ## Product
6
+
7
+ - What core user outcome does this repository serve?
8
+ - Which flows matter enough to deserve explicit product specs first?
9
+ - Which non-goals should the harness make visible?
10
+
11
+ ## Reliability
12
+
13
+ - What failure is unacceptable in production?
14
+ - What recovery time or uptime expectation matters most?
15
+ - Which runtime environments must be validated locally before merge?
16
+
17
+ ## Security
18
+
19
+ - Does the repo handle credentials, customer data, regulated data, or privileged actions?
20
+ - Are there required review gates for authentication, authorization, or secrets handling?
21
+
22
+ ## Frontend
23
+
24
+ - Is the product expected to have a polished user-facing interface, an internal tool UI, or no frontend?
25
+ - Which browsers, devices, or accessibility expectations are non-negotiable?
26
+
27
+ ## References
28
+
29
+ - Which external docs are worth copying into `docs/references/` because the team uses them repeatedly?
@@ -0,0 +1,12 @@
1
+ # SOP Index
2
+
3
+ Choose an SOP whenever the task touches one of these areas:
4
+
5
+ - architecture or layering changes: `docs/sops/layered-domain-architecture-setup.md`
6
+ - missing durable repository knowledge: `docs/sops/encode-unseen-knowledge.md`
7
+ - runtime debugging or observability setup: `docs/sops/local-observability-feedback-loop.md`
8
+ - user interface work: `docs/sops/chrome-devtools-ui-validation-loop.md`
9
+ - product correctness, frontend layout, or bug-discovery evals: `docs/sops/evidence-first-eval-loop.md`
10
+ - backend behavior, architecture boundaries, data/state, security, or performance issue triage: start from the Issue Workflows in `AGENTS.md`, then follow the domain docs listed there
11
+
12
+ If no SOP exists for a recurring workflow, create one in `docs/sops/` as part of the task.
@@ -0,0 +1,13 @@
1
+ # Template Policy
2
+
3
+ Every generated file starts with a managed marker:
4
+
5
+ `<!-- harness-engine:managed -->`
6
+
7
+ Init behavior:
8
+
9
+ - `init`: create missing files for new repositories; when an existing managed harness is detected, refresh managed files and create missing files while preserving unmanaged files
10
+
11
+ Use `init` as the normal workspace command so creation and reconciliation share one path. Use `--force` only when the human explicitly accepts overwriting.
12
+
13
+ If a file exists without the managed marker, treat it as user-owned unless the human explicitly asks to replace it.
@@ -0,0 +1,55 @@
1
+ # Workflow
2
+
3
+ Use this skill in two passes.
4
+
5
+ ## Pass 1: Analyze and Confirm
6
+
7
+ Run `analyze` before editing repository docs.
8
+
9
+ Ask the human only about facts that cannot be derived safely from the repo, especially:
10
+
11
+ - product domain and top-level outcomes
12
+ - intended users or operators
13
+ - production reliability expectations
14
+ - security or compliance constraints
15
+ - frontend experience bar
16
+ - canonical external references worth pinning inside `docs/references/`
17
+
18
+ Do not ask for facts that can be inferred from source layout, dependency manifests, or existing docs.
19
+
20
+ Also inspect the analysis for:
21
+
22
+ - missing durable knowledge that should be written during the task
23
+ - missing execution-plan state
24
+ - which SOPs should be referenced in the generated router docs
25
+
26
+ ## Pass 2: Init
27
+
28
+ Run `sample-answers`, fill the answers, then run `init`.
29
+
30
+ Use `init` for both first-time adoption and managed-harness reconciliation. It creates a new harness when none exists, and refreshes managed harness files plus backfills newly introduced managed files when an existing managed harness is detected. Unmanaged user files are preserved unless `--force` is explicitly used.
31
+
32
+ After the script runs, read the generated docs once and tighten weak generic phrases before handing off.
33
+
34
+ ## Ongoing Use
35
+
36
+ After the scaffold exists:
37
+
38
+ - read `docs/exec-plans/workstreams.md` before resuming interrupted or long-running work
39
+ - create an execution plan before multi-step work
40
+ - use `plan-start` instead of creating plan files manually when possible
41
+ - log durable facts during execution instead of waiting until the end
42
+ - follow the matching SOP for architecture, UI, observability, or knowledge capture work
43
+ - route product, frontend, backend, architecture, data/state, security, performance, and reliability questions through the Issue Workflows in `AGENTS.md`, even when the user did not invoke the harness skill by name
44
+ - encode durable knowledge back into the repository before closing the task
45
+ - mark logged knowledge items as written after updating the permanent docs; the `knowledge-mark-written` evidence must be exact text already present in the destination doc, not a paraphrase
46
+ - log every defect found by tests, evals, browser validation, or code review with `defect-log`
47
+ - resolve logged defects only after fixing the implementation and citing passing validation with `defect-resolve`
48
+ - run `quality-score` after implementation and validation, with evidence notes for every dimension
49
+ - if `quality-score` fails, implement the `## Rework Required` items and score again
50
+ - use `phase-set` and `workstream-upsert` when a plan belongs to phased or resumable work
51
+ - use `plan-close` to verify no durable knowledge is left stranded in the active plan
52
+ - before `plan-close`, replace generic plan placeholders with task-specific scope, constraints, steps, validation, and completion notes; delete unused ad hoc durable-knowledge TODOs
53
+ - run `.codex/skills/harness-engine/scripts/manage_harness.py check --repo <target-repo>` before handoff
54
+ - preview stale generated evidence with `evidence-prune` when `docs/generated/` contains old screenshots, DOM dumps, layout summaries, or smoke outputs; review the dry-run output before using `--apply`
55
+ - do not add CI to the target repository unless the human explicitly asks for it