@hallucination-studio/harness-engine 1.0.0-beta.9.bb2cd30 → 1.0.0-nightly.20260613.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +181 -40
- package/bin/install.js +29 -17
- package/package.json +5 -5
- package/skills/harness-engine/SKILL.md +97 -0
- package/skills/harness-engine/agents/openai.yaml +4 -0
- package/skills/harness-engine/evals/cases.json +94 -0
- package/skills/harness-engine/evals/harness_engine_evals/__init__.py +1 -0
- package/skills/harness-engine/evals/harness_engine_evals/cases_frontend.py +211 -0
- package/skills/harness-engine/evals/harness_engine_evals/cases_lifecycle.py +1616 -0
- package/skills/harness-engine/evals/harness_engine_evals/helpers.py +155 -0
- package/skills/harness-engine/evals/harness_engine_evals/registry.py +55 -0
- package/skills/harness-engine/evals/harness_engine_evals/report.py +36 -0
- package/skills/harness-engine/evals/harness_engine_evals/runner.py +53 -0
- package/skills/harness-engine/evals/run_evals.py +14 -0
- package/skills/{harness-repo-bootstrap → harness-engine}/references/evaluation-loop.md +8 -4
- package/skills/harness-engine/references/evidence-first-evals.md +187 -0
- package/skills/harness-engine/references/exec-plans.md +59 -0
- package/skills/{harness-repo-bootstrap → harness-engine}/references/file-map.md +3 -3
- package/skills/{harness-repo-bootstrap → harness-engine}/references/knowledge-capture.md +2 -2
- package/skills/{harness-repo-bootstrap → harness-engine}/references/sop-index.md +3 -0
- package/skills/harness-engine/references/template-policy.md +17 -0
- package/skills/harness-engine/references/workflow.md +62 -0
- package/skills/harness-engine/scripts/harness_engine/__init__.py +1 -0
- package/skills/harness-engine/scripts/harness_engine/analysis.py +240 -0
- package/skills/harness-engine/scripts/harness_engine/checks.py +287 -0
- package/skills/harness-engine/scripts/harness_engine/cli.py +656 -0
- package/skills/harness-engine/scripts/harness_engine/common.py +977 -0
- package/skills/harness-engine/scripts/harness_engine/continuation.py +520 -0
- package/skills/harness-engine/scripts/harness_engine/git_ops.py +88 -0
- package/skills/harness-engine/scripts/harness_engine/knowledge.py +329 -0
- package/skills/harness-engine/scripts/harness_engine/plans.py +630 -0
- package/skills/harness-engine/scripts/harness_engine/templates.py +124 -0
- package/skills/harness-engine/scripts/manage_harness.py +14 -0
- package/skills/harness-repo-bootstrap/SKILL.md +0 -79
- package/skills/harness-repo-bootstrap/agents/openai.yaml +0 -4
- package/skills/harness-repo-bootstrap/evals/cases.json +0 -26
- package/skills/harness-repo-bootstrap/evals/run_evals.py +0 -788
- package/skills/harness-repo-bootstrap/references/exec-plans.md +0 -49
- package/skills/harness-repo-bootstrap/references/template-policy.md +0 -12
- package/skills/harness-repo-bootstrap/references/workflow.md +0 -53
- package/skills/harness-repo-bootstrap/scripts/manage_harness.py +0 -2175
- /package/skills/{harness-repo-bootstrap → harness-engine}/assets/repo-template/.keep +0 -0
- /package/skills/{harness-repo-bootstrap → harness-engine}/assets/sops/.keep +0 -0
- /package/skills/{harness-repo-bootstrap → harness-engine}/references/question-catalog.md +0 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Hallucination Studio
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
CHANGED
|
@@ -1,26 +1,38 @@
|
|
|
1
1
|
# Harness Engine
|
|
2
2
|
|
|
3
|
+
[](https://www.npmjs.com/package/@hallucination-studio/harness-engine)
|
|
4
|
+
[](https://www.npmjs.com/package/@hallucination-studio/harness-engine)
|
|
5
|
+
[](https://github.com/hallucination-studio/harness-engine/actions/workflows/ci.yml)
|
|
6
|
+
[](https://github.com/hallucination-studio/harness-engine/actions/workflows/publish-release.yml)
|
|
7
|
+
[](https://github.com/hallucination-studio/harness-engine/actions/workflows/publish-nightly.yml)
|
|
8
|
+
[](https://github.com/hallucination-studio/harness-engine/blob/main/LICENSE)
|
|
9
|
+
|
|
3
10
|
Harness Engine packages a Codex skill that bootstraps an agent-first repository harness.
|
|
4
11
|
It turns the repository-shaping ideas from OpenAI's
|
|
5
12
|
["Harness engineering: leveraging Codex in an agent-first world"](https://openai.com/index/harness-engineering/)
|
|
6
13
|
into an installable `npx` workflow.
|
|
7
14
|
|
|
8
15
|
The package does not install a harness into this repository. This repository builds and publishes
|
|
9
|
-
the installer. Users install the bundled `harness-
|
|
16
|
+
the installer. Users install the bundled `harness-engine` skill into their own project or
|
|
10
17
|
global Codex skill directory, then ask Codex to use that skill to analyze the target repository,
|
|
11
18
|
ask for missing high-impact facts, create the harness files, and keep future work closed-loop.
|
|
12
19
|
|
|
13
20
|
## What This Project Does
|
|
14
21
|
|
|
15
|
-
- Installs the `harness-
|
|
22
|
+
- Installs the `harness-engine` Codex skill locally, globally, or into a custom skills directory.
|
|
16
23
|
- Provides a repository analyzer that detects language, package manager, frontend signals, existing harness files, missing execution-plan state, and missing SOPs.
|
|
17
|
-
- Generates a short routing-style `AGENTS.md` plus durable system-of-record docs such as `ARCHITECTURE.md`, `docs/RELIABILITY.md`, `docs/SECURITY.md`,
|
|
18
|
-
-
|
|
24
|
+
- Generates a short routing-style `AGENTS.md` plus durable system-of-record docs such as `ARCHITECTURE.md`, `docs/RELIABILITY.md`, `docs/SECURITY.md`, and `docs/QUALITY_SCORE.md`.
|
|
25
|
+
- Generates `docs/FRONTEND.md`, `docs/DESIGN.md`, and `docs/design-docs/` only when a frontend surface is detected.
|
|
26
|
+
- Creates version-controlled execution-plan folders for active and completed plans.
|
|
19
27
|
- Adds SOPs for architecture setup, knowledge capture, local observability, and UI validation.
|
|
28
|
+
- Reconciles managed harnesses through the same `init` flow, refreshing managed files and backfilling newly introduced managed files while preserving unmanaged docs.
|
|
29
|
+
- Provides `clean` to remove local skill installs and generated evidence, add `.gitignore` entries, and untrack already committed runtime artifacts so a follow-up commit deletes them from the remote.
|
|
20
30
|
- Enforces a local harness check without assuming the user's project has CI.
|
|
31
|
+
- Previews and optionally removes stale unreferenced generated evidence under `docs/generated/`.
|
|
21
32
|
- Supports durable knowledge closure with stable knowledge IDs and evidence text, so permanent docs can use natural wording instead of duplicated checklist strings.
|
|
22
|
-
- Enforces
|
|
33
|
+
- Enforces structured execution-plan state: `acceptance-set` creates a pre-implementation Acceptance Contract, `quality-score` records post-implementation evidence against that contract, and stale or failed scores block `plan-close`.
|
|
23
34
|
- Tracks resumable workstreams so interrupted features, refactors, reliability work, and cleanup efforts can be recovered from repo state instead of chat history.
|
|
35
|
+
- For frontend projects, asks for the desired visual style and initializes a repository-owned visual specification based on the local DESIGN.md format pattern: YAML design tokens plus markdown rationale.
|
|
24
36
|
|
|
25
37
|
## Why It Exists
|
|
26
38
|
|
|
@@ -74,55 +86,175 @@ Show where the skill would be installed:
|
|
|
74
86
|
npx @hallucination-studio/harness-engine where --local
|
|
75
87
|
```
|
|
76
88
|
|
|
89
|
+
## Frontend Design Docs
|
|
90
|
+
|
|
91
|
+
Harness Engine has no external design runtime dependency and never calls an external design skill
|
|
92
|
+
during `init`. When a target repository has no frontend, it does not generate `docs/FRONTEND.md`,
|
|
93
|
+
`docs/DESIGN.md`, or `docs/design-docs/`.
|
|
94
|
+
|
|
95
|
+
When a frontend is detected, Harness Engine creates:
|
|
96
|
+
|
|
97
|
+
- `docs/FRONTEND.md`: project positioning, requested style direction, existing frontend code signals, frontend scope, stack notes, validation loop, and the read order for UI work.
|
|
98
|
+
- `docs/DESIGN.md`: a project-owned unified visual specification using YAML design tokens plus markdown rationale, seeded from the human-confirmed style direction and existing frontend code signals. It defines semantic colors, a unified typography scale, spacing/radius tokens, component states, and rules for mapping those tokens into the project's shared style layer.
|
|
99
|
+
- `docs/design-docs/`: durable design decisions and style-system notes.
|
|
100
|
+
|
|
101
|
+
The templates are informed by the local reference checkout at `/Users/murphy/code/github/design.md`
|
|
102
|
+
for document shape only. The target project owns the content and should replace starter tokens and
|
|
103
|
+
prose with its concrete product style before substantial UI work.
|
|
104
|
+
|
|
105
|
+
## Update An Installed Skill Package
|
|
106
|
+
|
|
107
|
+
The `npx` installer installs or replaces the `harness-engine` Codex skill.
|
|
108
|
+
To update an already installed skill, rerun `install` with `--force` in the same install location.
|
|
109
|
+
|
|
110
|
+
Replace the local skill install:
|
|
111
|
+
|
|
112
|
+
```bash
|
|
113
|
+
npx @hallucination-studio/harness-engine install --local --force
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
Replace the global skill install:
|
|
117
|
+
|
|
118
|
+
```bash
|
|
119
|
+
npx @hallucination-studio/harness-engine install --global --force
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
Replace a custom skill install:
|
|
123
|
+
|
|
124
|
+
```bash
|
|
125
|
+
npx @hallucination-studio/harness-engine install --path /path/to/skills --force
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
After the skill package is installed, the target repository workflow happens inside Codex. In the
|
|
129
|
+
target workspace, invoke the skill:
|
|
130
|
+
|
|
131
|
+
```text
|
|
132
|
+
$harness-engine
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
The skill should analyze the workspace and run the single workspace entrypoint:
|
|
136
|
+
|
|
137
|
+
- If the harness is not installed in that repository, `manage_harness.py init` creates it.
|
|
138
|
+
- If a managed harness already exists, `manage_harness.py init` reconciles it by refreshing managed files and backfilling newly introduced managed files.
|
|
139
|
+
- Unmanaged user files are preserved unless `--force` is explicitly used.
|
|
140
|
+
|
|
141
|
+
Codex runs the underlying manager commands. Users should not need to call the Python script
|
|
142
|
+
directly during normal work.
|
|
143
|
+
|
|
77
144
|
## Use The Skill In A Target Repo
|
|
78
145
|
|
|
79
146
|
After installing, open Codex in the target repository and invoke:
|
|
80
147
|
|
|
81
148
|
```text
|
|
82
|
-
$harness-
|
|
149
|
+
$harness-engine
|
|
83
150
|
```
|
|
84
151
|
|
|
85
152
|
The intended workflow is:
|
|
86
153
|
|
|
87
154
|
1. Analyze the target repository.
|
|
88
155
|
2. Ask the human only for unresolved, high-impact facts.
|
|
89
|
-
3. Initialize or
|
|
90
|
-
4. Create execution plans for
|
|
91
|
-
5.
|
|
92
|
-
6.
|
|
93
|
-
7.
|
|
94
|
-
8.
|
|
95
|
-
9.
|
|
96
|
-
10.
|
|
97
|
-
11.
|
|
98
|
-
12.
|
|
156
|
+
3. Initialize or reconcile the harness files.
|
|
157
|
+
4. Create or reuse execution plans for repository-mutating work, including code, docs, configuration, tests, dependencies, build/release scripts, generated templates, runtime behavior, migrations, cleanup, and review fixes.
|
|
158
|
+
5. Define the Acceptance Contract before implementation with product, UX, architecture, reliability, and security criteria.
|
|
159
|
+
6. Log durable knowledge into active plans.
|
|
160
|
+
7. Write the durable facts into permanent docs.
|
|
161
|
+
8. Mark knowledge as written using ID plus evidence text.
|
|
162
|
+
9. Score the finished work against the Acceptance Contract across product, UX/operator clarity, architecture, reliability, and security.
|
|
163
|
+
10. If the Quality Result fails, implement the generated `## Rework Required` items and score again.
|
|
164
|
+
11. Before closing, record a `Continuation Decision` for the plan.
|
|
165
|
+
12. Close the execution plan only after the Quality Result passes against the current contract fingerprint, the continuation decision is recorded, and durable docs are updated.
|
|
166
|
+
13. Run the local harness check before handoff.
|
|
167
|
+
14. Periodically run `evidence-prune` to preview stale unreferenced generated evidence, and apply it only after reviewing the candidate list.
|
|
168
|
+
|
|
169
|
+
## User Continuation UX
|
|
170
|
+
|
|
171
|
+
Users should express the desired continuation state in natural language. Codex then runs the
|
|
172
|
+
required harness commands and repairs any blocked state before handoff.
|
|
173
|
+
|
|
174
|
+
Useful phrases:
|
|
175
|
+
|
|
176
|
+
- "这项完成了,没有后续" / "mark this complete"
|
|
177
|
+
- "这项要继续到下一阶段" / "continue this as a follow-up workstream"
|
|
178
|
+
- "先暂停,等 API 定稿后恢复" / "pause until the API contract is approved"
|
|
179
|
+
- "停止这个方向,记录原因" / "stop this work and record why"
|
|
180
|
+
- "这个放到技术债,不进入当前 workstream" / "defer this to tech debt"
|
|
181
|
+
|
|
182
|
+
When the user says a task should continue or pause, Codex records the workstream, next action,
|
|
183
|
+
resume notes, and goal in `docs/exec-plans/workstreams.md`. When the user says it is complete,
|
|
184
|
+
Codex records a complete decision and closes the plan without creating a workstream entry.
|
|
185
|
+
|
|
186
|
+
## CLI Reference
|
|
99
187
|
|
|
100
|
-
The installed skill exposes
|
|
188
|
+
The installed skill exposes a manager script for Codex and for advanced debugging:
|
|
101
189
|
|
|
102
190
|
```bash
|
|
103
|
-
python3 .codex/skills/harness-
|
|
191
|
+
python3 .codex/skills/harness-engine/scripts/manage_harness.py --help
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
For frontend or visual-design work, the generated harness uses `docs/FRONTEND.md` to route agents through `docs/DESIGN.md`. `docs/FRONTEND.md` defines which files are controlled by `docs/DESIGN.md`: design notes under `docs/design-docs/`, Tailwind theme files, global CSS variables, component theme modules, Storybook/theme previews, and UI implementation files that consume shared tokens or style rules. Agents should read `docs/FRONTEND.md`, then `docs/DESIGN.md`, then the relevant component, theme, or stylesheet.
|
|
195
|
+
|
|
196
|
+
These commands are not the primary user interface. They are shown so maintainers can debug or
|
|
197
|
+
inspect what Codex runs:
|
|
198
|
+
|
|
199
|
+
```bash
|
|
200
|
+
python3 .codex/skills/harness-engine/scripts/manage_harness.py analyze --repo . --output analysis.json
|
|
201
|
+
python3 .codex/skills/harness-engine/scripts/manage_harness.py sample-answers --analysis analysis.json --output answers.json
|
|
202
|
+
python3 .codex/skills/harness-engine/scripts/manage_harness.py init --repo . --answers answers.json
|
|
203
|
+
python3 .codex/skills/harness-engine/scripts/manage_harness.py plan-start --repo . --slug feature-name --goal "Implement the feature"
|
|
204
|
+
python3 .codex/skills/harness-engine/scripts/manage_harness.py acceptance-set --repo . --plan docs/exec-plans/active/2026-06-11-feature-name.md --product "The feature satisfies the named user workflow and expected output." --ux "The user or operator can complete the workflow without ambiguous states." --architecture "The change fits the existing module boundaries and keeps plan state recoverable." --reliability "The validation commands and failure evidence are repeatable from a clean checkout." --security "The change introduces no secrets and preserves sensitive-data handling rules."
|
|
205
|
+
python3 .codex/skills/harness-engine/scripts/manage_harness.py quality-score --repo . --plan docs/exec-plans/active/2026-06-11-feature-name.md --product-correctness 8 --product-note "Product assertions passed" --ux-operator-clarity 8 --ux-note "User workflow evidence passed" --architecture-maintainability 8 --architecture-note "Boundary and maintainability review passed" --reliability-observability 8 --reliability-note "Tests and smoke checks passed" --security-data-handling 8 --security-note "No new sensitive-data paths or secrets"
|
|
206
|
+
python3 .codex/skills/harness-engine/scripts/manage_harness.py continuation-set --repo . --plan docs/exec-plans/active/2026-06-11-feature-name.md --decision complete --closure-reason "Feature is complete with no follow-up."
|
|
207
|
+
python3 .codex/skills/harness-engine/scripts/manage_harness.py continuation-set --repo . --plan docs/exec-plans/active/2026-06-11-feature-name.md --decision continue --workstream feature-name --next-target docs/exec-plans/workstreams.md#feature-name --next-action "Create the next execution plan" --goal "Deliver the feature across follow-up execution plans"
|
|
208
|
+
python3 .codex/skills/harness-engine/scripts/manage_harness.py check --repo .
|
|
209
|
+
python3 .codex/skills/harness-engine/scripts/manage_harness.py evidence-prune --repo . --older-than-days 14
|
|
210
|
+
python3 .codex/skills/harness-engine/scripts/manage_harness.py evidence-prune --repo . --older-than-days 14 --apply
|
|
211
|
+
python3 .codex/skills/harness-engine/scripts/manage_harness.py clean --repo .
|
|
212
|
+
python3 .codex/skills/harness-engine/scripts/manage_harness.py clean --repo . --apply
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
The quality workflow is intentionally local and repository-owned. It does not require the user's
|
|
216
|
+
project to have CI. Active plans must have a ready Acceptance Contract sidecar so work is
|
|
217
|
+
recoverable before implementation finishes. Completed plans must have a passing Quality Result
|
|
218
|
+
scored against the current Acceptance Contract fingerprint; `plan-close` rejects stale scores,
|
|
219
|
+
open defects, unresolved placeholders, and unresolved durable knowledge. Blocked `plan-close`
|
|
220
|
+
commands return structured JSON with `status: "blocked"`, a stable `reason`, a user-readable
|
|
221
|
+
`message`, and machine-readable `details`.
|
|
222
|
+
|
|
223
|
+
## Version Control Policy
|
|
224
|
+
|
|
225
|
+
Commit harness docs that carry durable repository knowledge: `AGENTS.md`, `ARCHITECTURE.md`,
|
|
226
|
+
`docs/PLANS.md`, `docs/QUALITY_SCORE.md`, `docs/RELIABILITY.md`, `docs/SECURITY.md`,
|
|
227
|
+
`docs/FRONTEND.md`, `docs/sops/`, `docs/product-specs/`, `docs/design-docs/`,
|
|
228
|
+
`docs/references/`, and execution-plan state.
|
|
229
|
+
|
|
230
|
+
Execution plans are project state. Commit active plans, completed plans, JSON sidecars, and `docs/exec-plans/workstreams.md` so another agent can recover the work from the repository.
|
|
231
|
+
|
|
232
|
+
Do not commit local skill installs or generated evidence by default. `clean --apply` adds these directory-level ignores:
|
|
233
|
+
|
|
234
|
+
```gitignore
|
|
235
|
+
# harness-engine transient files
|
|
236
|
+
.codex/skills/
|
|
237
|
+
docs/generated/
|
|
238
|
+
# end harness-engine transient files
|
|
104
239
|
```
|
|
105
240
|
|
|
106
|
-
|
|
241
|
+
If those files were already committed or pushed, run:
|
|
107
242
|
|
|
108
243
|
```bash
|
|
109
|
-
python3 .codex/skills/harness-
|
|
110
|
-
python3 .codex/skills/harness-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
python3 .codex/skills/harness-repo-bootstrap/scripts/manage_harness.py workstream-upsert --repo . --id feature-name --status active --current-plan docs/exec-plans/active/2026-06-11-feature-name.md --next-action "Create Phase 2 plan"
|
|
116
|
-
python3 .codex/skills/harness-repo-bootstrap/scripts/manage_harness.py check --repo .
|
|
244
|
+
python3 .codex/skills/harness-engine/scripts/manage_harness.py clean --repo .
|
|
245
|
+
python3 .codex/skills/harness-engine/scripts/manage_harness.py clean --repo . --apply
|
|
246
|
+
git status --short
|
|
247
|
+
git diff --cached --stat
|
|
248
|
+
git commit -m "Remove harness runtime artifacts from git"
|
|
249
|
+
git push
|
|
117
250
|
```
|
|
118
251
|
|
|
119
|
-
|
|
120
|
-
project to have CI. `plan-close` refuses to move a plan to `completed` unless `quality-score`
|
|
121
|
-
has passed, and `check` reports active plans whose quality gate is missing or failing.
|
|
252
|
+
`clean --apply` removes local generated evidence, then uses `git rm --cached` to stage removal of tracked local skill installs and generated evidence from git and the remote. It does not remove, ignore, or untrack execution plans, JSON sidecars, or workstreams.
|
|
122
253
|
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
254
|
+
Every plan closes with a `Continuation Decision`: `complete`, `continue`, `pause`, `stop`, or
|
|
255
|
+
`defer`. Only resumable `continue` and `pause` decisions enter `docs/exec-plans/workstreams.md`;
|
|
256
|
+
one-off completed plans do not need workstream entries. Invalid `continue` or `pause` inputs fail before
|
|
257
|
+
writing workstream state, and workstream goals are taken from `--goal` or the plan goal.
|
|
126
258
|
|
|
127
259
|
## Generated Harness Shape
|
|
128
260
|
|
|
@@ -188,6 +320,15 @@ Check npm package contents:
|
|
|
188
320
|
npm run pack:check
|
|
189
321
|
```
|
|
190
322
|
|
|
323
|
+
Before release, run:
|
|
324
|
+
|
|
325
|
+
```bash
|
|
326
|
+
npm test
|
|
327
|
+
npm run smoke:install
|
|
328
|
+
npm run pack:check
|
|
329
|
+
git diff --check
|
|
330
|
+
```
|
|
331
|
+
|
|
191
332
|
The publish workflows expect an npm token when trusted publishing is not yet configured:
|
|
192
333
|
|
|
193
334
|
```text
|
|
@@ -200,15 +341,15 @@ These scores describe the current implementation, not an external guarantee.
|
|
|
200
341
|
|
|
201
342
|
| Layer | Score | Notes |
|
|
202
343
|
| --- | ---: | --- |
|
|
203
|
-
| Product fit |
|
|
204
|
-
| Skill workflow design | 9 / 10 | Strong progressive workflow: analyze, confirm,
|
|
205
|
-
| Knowledge, quality, and workstream closure loop |
|
|
344
|
+
| Product fit | 9 / 10 | Clear purpose: install a Codex skill that creates and maintains an agent-first repository harness. Real acceptance against a fresh Go backend plus browser frontend project validated generation and later issue workflows. Broader usage across more project types would still improve confidence. |
|
|
345
|
+
| Skill workflow design | 9.2 / 10 | Strong progressive workflow: analyze, confirm, init/reconcile, plan, capture knowledge, validate, score with evidence notes, rework, record continuity, close. The workflow now explicitly routes repository-mutating feature, bug, refactor, docs, dependency, UI, test, security, performance, and reliability work through the same lifecycle. |
|
|
346
|
+
| Knowledge, quality, and workstream closure loop | 9.3 / 10 | Stable knowledge IDs plus exact destination evidence reduce noisy doc duplication. Execution plans now have JSON sidecars for Acceptance Contracts, Quality Results, defects, and knowledge state; `quality-score` rejects missing evidence notes or missing contracts, defects invalidate stale scores, and workstreams make resumable follow-up work recoverable. |
|
|
206
347
|
| CLI installer | 8 / 10 | Simple local/global/custom install modes, force replacement, and path discovery. It is intentionally minimal and does not manage Codex runtime configuration. |
|
|
207
|
-
| Generated harness docs |
|
|
208
|
-
| Evaluation coverage |
|
|
348
|
+
| Generated harness docs | 8.4 / 10 | Covers architecture, plans, reliability, security, frontend policy, broad task intake, issue workflows, references, generated artifacts, and SOPs. The docs now front-load exact knowledge evidence, per-dimension quality notes, default plan lifecycle, and plan placeholder cleanup, but templates still require Codex to tighten project-specific language after generation. |
|
|
349
|
+
| Evaluation coverage | 9.2 / 10 | `npm test` runs 23 structured eval cases covering empty-repo init, frontend analysis, init reconciliation, clean command behavior, broad task intake, closed-loop plan behavior, continuation decisions, path canonicalization, defect recovery, required quality-score notes, exact knowledge evidence, structured sidecars, acceptance readiness, stale score rejection, generated-evidence cleanup, eval report shape, user-owned doc preservation, and frontend design control. A fully automated Codex child-agent E2E would raise this further. |
|
|
209
350
|
| Release automation | 8 / 10 | Supports stable release, beta on every main commit, nightly, manual dry-run, artifacts, provenance, and token fallback. npm first-publish/trusted-publishing setup still requires external configuration. |
|
|
210
|
-
| User-project safety | 8.
|
|
211
|
-
| Overall |
|
|
351
|
+
| User-project safety | 8.8 / 10 | The skill avoids adding CI to target projects by default, preserves unmanaged files unless forced, and requires evidence-backed closure for defects and durable knowledge. More destructive-change simulation in evals would improve this score. |
|
|
352
|
+
| Overall | 9.1 / 10 | The skill is now strong enough for regular use: self evals pass across the structured suite, real acceptance covered initial scaffold plus frontend and backend issue workflows, and plan lifecycle state is enforced through JSON sidecars. Remaining leverage is automated child-agent E2E coverage. |
|
|
212
353
|
|
|
213
354
|
## Reference
|
|
214
355
|
|
package/bin/install.js
CHANGED
|
@@ -5,8 +5,7 @@ const os = require("os");
|
|
|
5
5
|
const path = require("path");
|
|
6
6
|
|
|
7
7
|
const PACKAGE_ROOT = path.resolve(__dirname, "..");
|
|
8
|
-
const SKILL_NAME = "harness-
|
|
9
|
-
const SOURCE_SKILL_DIR = path.join(PACKAGE_ROOT, "skills", SKILL_NAME);
|
|
8
|
+
const SKILL_NAME = "harness-engine";
|
|
10
9
|
|
|
11
10
|
function printHelp() {
|
|
12
11
|
console.log(`harness-engine
|
|
@@ -85,32 +84,45 @@ function copyDir(sourceDir, targetDir) {
|
|
|
85
84
|
for (const entry of fs.readdirSync(sourceDir, { withFileTypes: true })) {
|
|
86
85
|
const sourcePath = path.join(sourceDir, entry.name);
|
|
87
86
|
const targetPath = path.join(targetDir, entry.name);
|
|
88
|
-
|
|
87
|
+
const stat = fs.statSync(sourcePath);
|
|
88
|
+
if (stat.isDirectory()) {
|
|
89
89
|
copyDir(sourcePath, targetPath);
|
|
90
|
+
} else if (entry.isSymbolicLink()) {
|
|
91
|
+
const linkTarget = fs.readlinkSync(sourcePath);
|
|
92
|
+
fs.symlinkSync(linkTarget, targetPath);
|
|
90
93
|
} else {
|
|
91
94
|
fs.copyFileSync(sourcePath, targetPath);
|
|
92
|
-
const stat = fs.statSync(sourcePath);
|
|
93
95
|
fs.chmodSync(targetPath, stat.mode);
|
|
94
96
|
}
|
|
95
97
|
}
|
|
96
98
|
}
|
|
97
99
|
|
|
98
|
-
function
|
|
99
|
-
const
|
|
100
|
-
if (!fs.existsSync(
|
|
101
|
-
throw new Error(`Bundled skill not found: ${
|
|
100
|
+
function assertSkillSource() {
|
|
101
|
+
const sourcePath = path.join(PACKAGE_ROOT, "skills", SKILL_NAME);
|
|
102
|
+
if (!fs.existsSync(sourcePath)) {
|
|
103
|
+
throw new Error(`Bundled skill not found: ${sourcePath}`);
|
|
102
104
|
}
|
|
105
|
+
}
|
|
103
106
|
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
}
|
|
108
|
-
fs.rmSync(skillTargetDir, { recursive: true, force: true });
|
|
107
|
+
function removeIfExists(targetPath, force, label) {
|
|
108
|
+
if (!fs.existsSync(targetPath)) {
|
|
109
|
+
return;
|
|
109
110
|
}
|
|
110
111
|
|
|
112
|
+
if (!force) {
|
|
113
|
+
throw new Error(`${label} already exists at ${targetPath}. Re-run with --force to replace it.`);
|
|
114
|
+
}
|
|
115
|
+
|
|
116
|
+
fs.rmSync(targetPath, { recursive: true, force: true });
|
|
117
|
+
}
|
|
118
|
+
|
|
119
|
+
function installSkill(destinationDir, force) {
|
|
120
|
+
assertSkillSource();
|
|
111
121
|
fs.mkdirSync(destinationDir, { recursive: true });
|
|
112
|
-
|
|
113
|
-
|
|
122
|
+
const skillTarget = path.join(destinationDir, SKILL_NAME);
|
|
123
|
+
removeIfExists(skillTarget, force, "Skill");
|
|
124
|
+
copyDir(path.join(PACKAGE_ROOT, "skills", SKILL_NAME), skillTarget);
|
|
125
|
+
return skillTarget;
|
|
114
126
|
}
|
|
115
127
|
|
|
116
128
|
function main() {
|
|
@@ -143,8 +155,8 @@ function main() {
|
|
|
143
155
|
|
|
144
156
|
try {
|
|
145
157
|
const installedPath = installSkill(destinationDir, args.force);
|
|
146
|
-
console.log(`Installed ${SKILL_NAME} to ${installedPath}`);
|
|
147
|
-
console.log("Invoke it in Codex with $harness-
|
|
158
|
+
console.log(`Installed ${SKILL_NAME} skill to ${installedPath}`);
|
|
159
|
+
console.log("Invoke it in Codex with $harness-engine.");
|
|
148
160
|
} catch (error) {
|
|
149
161
|
console.error(`Install failed: ${error.message}`);
|
|
150
162
|
process.exit(1);
|
package/package.json
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@hallucination-studio/harness-engine",
|
|
3
|
-
"version": "1.0.0-
|
|
4
|
-
"description": "Install
|
|
3
|
+
"version": "1.0.0-nightly.20260613.2",
|
|
4
|
+
"description": "Install the harness-engine Codex skill for initializing and reconciling advanced repository harness docs.",
|
|
5
5
|
"repository": {
|
|
6
6
|
"type": "git",
|
|
7
7
|
"url": "git+https://github.com/hallucination-studio/harness-engine.git"
|
|
@@ -13,7 +13,7 @@
|
|
|
13
13
|
"access": "public"
|
|
14
14
|
},
|
|
15
15
|
"scripts": {
|
|
16
|
-
"test": "python3 skills/harness-
|
|
16
|
+
"test": "python3 skills/harness-engine/evals/run_evals.py",
|
|
17
17
|
"smoke:install": "node scripts/smoke_install.js",
|
|
18
18
|
"pack:check": "npm pack --dry-run"
|
|
19
19
|
},
|
|
@@ -23,9 +23,9 @@
|
|
|
23
23
|
"skills/**/agents/**",
|
|
24
24
|
"skills/**/assets/**",
|
|
25
25
|
"skills/**/evals/*.json",
|
|
26
|
-
"skills/**/evals
|
|
26
|
+
"skills/**/evals/**/*.py",
|
|
27
27
|
"skills/**/references/**",
|
|
28
|
-
"skills/**/scripts
|
|
28
|
+
"skills/**/scripts/**/*.py"
|
|
29
29
|
],
|
|
30
30
|
"license": "MIT"
|
|
31
31
|
}
|
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: harness-engine
|
|
3
|
+
description: Initialize, refresh, and operate an advanced harness-engineering repository lifecycle for Codex-driven projects. Use when Codex needs to create or reconcile harness docs, or when work inside a harness-managed repository will change code, docs, configuration, tests, dependencies, build/release scripts, generated templates, runtime behavior, migrations, cleanup policy, or other durable repository state.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Harness Engine
|
|
7
|
+
|
|
8
|
+
Use the packaged script yourself to inspect the target repository before editing files. Do not ask the user to run the harness Python commands during normal work. Use the generated analysis to decide what to ask the human, what durable knowledge is missing from the repo, and which execution-plan and SOP files must be created or reconciled.
|
|
9
|
+
|
|
10
|
+
In a harness-managed repository, default every repository-mutating request into the harness lifecycle. Repository-mutating work includes code, docs, configuration, tests, dependencies, build/release scripts, generated templates, runtime behavior, migrations, cleanup, and fixes from review or user feedback. The only no-plan exceptions are pure question answering, read-only investigation, showing command output, and status reporting with no file changes. If an investigation turns into editing files, enter the lifecycle before editing.
|
|
11
|
+
|
|
12
|
+
The user-facing interface is intent, not CLI. If the user says a task is complete, should continue, should pause, should stop, or should become follow-up debt, translate that into the appropriate manager command yourself and report the outcome. Only show raw commands when the user explicitly asks for implementation details or debugging help.
|
|
13
|
+
|
|
14
|
+
## Workflow
|
|
15
|
+
|
|
16
|
+
1. Run `python3 scripts/manage_harness.py analyze --repo <target-repo> --output <analysis.json>`.
|
|
17
|
+
2. Read `analysis.json`.
|
|
18
|
+
3. Ask the human only the unresolved, high-impact questions from `human_confirmations`.
|
|
19
|
+
4. During initialization, create frontend design docs only when the analysis detects a frontend surface. Frontend repos get `docs/FRONTEND.md`, `docs/DESIGN.md`, and `docs/design-docs/`; backend-only repos do not. Ask the human for the desired visual style direction and use existing frontend style files as evidence. The generated `docs/DESIGN.md` is a project-owned visual specification shaped like DESIGN.md: YAML tokens plus markdown rationale. Do not call external design-generation skills or packages during init.
|
|
20
|
+
5. Run `python3 scripts/manage_harness.py sample-answers --analysis <analysis.json> --output <answers.json>`.
|
|
21
|
+
6. Fill the placeholders in `answers.json` from the repository and the human's confirmed answers.
|
|
22
|
+
7. Run `python3 scripts/manage_harness.py init --repo <target-repo> --answers <answers.json>`. This is the single workspace entrypoint: it creates a new harness when none exists, and reconciles a managed or partial harness when managed harness files are already present. Reconcile refreshes managed files, backfills newly introduced managed files, and preserves unmanaged user files. Pass `--force` only with explicit user approval.
|
|
23
|
+
8. For any repository-mutating task, run `python3 scripts/manage_harness.py plan-start --repo <target-repo> --slug <task-name> --goal "<goal>"` unless an active plan already covers the exact work. Small changes may use a lightweight plan, but they still require acceptance, validation, quality scoring, plan close, and check.
|
|
24
|
+
9. Before implementation, run `python3 scripts/manage_harness.py acceptance-set --repo <target-repo> --plan <plan-file> --product "<product criterion>" --ux "<UX criterion>" --architecture "<architecture criterion>" --reliability "<reliability criterion>" --security "<security criterion>"`. Criteria must be concrete to the task; generic templates are rejected.
|
|
25
|
+
10. If you learn durable facts during the work, run `python3 scripts/manage_harness.py knowledge-log --repo <target-repo> --plan <plan-file> --fact "<fact>" --destination <durable-doc>` and keep the returned `id`. Use `--fact-file <file>` when the fact contains shell-sensitive characters.
|
|
26
|
+
11. Before closing the task, write those facts into their durable docs.
|
|
27
|
+
12. Run `python3 scripts/manage_harness.py knowledge-mark-written --repo <target-repo> --plan <plan-file> --id <knowledge-id> --evidence "<verbatim text already in durable doc>"`; prefer `--evidence-file <file>` when evidence contains backticks, globs, quotes, pipes, or other shell-sensitive characters. Evidence must be copied from the destination doc, not summarized. Use `--append` only when the exact fact should be appended mechanically.
|
|
28
|
+
13. If validation, evals, browser checks, or code review reveal a bug, immediately run `python3 scripts/manage_harness.py defect-log --repo <target-repo> --plan <plan-file> --severity <P0|P1|P2|P3> --summary "<bug>" --evidence "<failing check>"`. This invalidates any existing quality result and makes the defect the next rework input.
|
|
29
|
+
14. Fix logged defects, then run `python3 scripts/manage_harness.py defect-resolve --repo <target-repo> --plan <plan-file> --id <bug-id> --fix-evidence "<passing check or code evidence>"`.
|
|
30
|
+
15. Score the finished work with `python3 scripts/manage_harness.py quality-score --repo <target-repo> --plan <plan-file> --product-correctness <0-10> --product-note "<evidence>" --ux-operator-clarity <0-10> --ux-note "<evidence>" --architecture-maintainability <0-10> --architecture-note "<evidence>" --reliability-observability <0-10> --reliability-note "<evidence>" --security-data-handling <0-10> --security-note "<evidence>"`. Every dimension needs an evidence note tied to the ready Acceptance Contract.
|
|
31
|
+
16. If `quality-score` fails, treat `## Rework Required` in the plan as the next implementation input, fix the work, then run `quality-score` again.
|
|
32
|
+
17. Before closing, run `python3 scripts/manage_harness.py continuation-set --repo <target-repo> --plan <plan-file> --decision <complete|continue|pause|stop|defer>`. Use `--workstream`, `--next-target`, `--next-action`, `--closure-reason`, `--resume-notes`, and `--goal` as needed; `continue` and `pause` update `workstreams.md` automatically only after required fields validate.
|
|
33
|
+
18. Before closing, replace generic plan placeholders with task-specific scope, constraints, steps, validation, and completion notes; leave no open durable-knowledge placeholder except the default unused line.
|
|
34
|
+
19. Close the plan with `python3 scripts/manage_harness.py plan-close --repo <target-repo> --plan <plan-file> --summary "<summary>"`.
|
|
35
|
+
20. Before handoff, run `python3 .codex/skills/harness-engine/scripts/manage_harness.py check --repo <target-repo>` from an installed target repository.
|
|
36
|
+
21. To review stale generated evidence, run `python3 scripts/manage_harness.py evidence-prune --repo <target-repo>` first; it is dry-run by default. Add `--apply` only after checking the candidate list.
|
|
37
|
+
22. To clean transient harness runtime files or remove already committed runtime files from the remote, run `python3 scripts/manage_harness.py clean --repo <target-repo>` first; it is dry-run by default. Add `--apply` to clean local runtime state, update `.gitignore`, and stage `git rm --cached` removals, then commit and push. Clean is limited to local skill installs and generated evidence; execution plans, sidecars, and workstreams are durable project state.
|
|
38
|
+
23. After changing this skill, run `python3 evals/run_evals.py` and iterate until it passes.
|
|
39
|
+
|
|
40
|
+
## Reading Order
|
|
41
|
+
|
|
42
|
+
- Read [references/workflow.md](references/workflow.md) first for the operating model and question policy.
|
|
43
|
+
- Read [references/file-map.md](references/file-map.md) when deciding which generated file to edit.
|
|
44
|
+
- Read [references/question-catalog.md](references/question-catalog.md) when the analysis surfaces ambiguous product, security, reliability, or frontend facts.
|
|
45
|
+
- Read [references/knowledge-capture.md](references/knowledge-capture.md) when you discover facts that should survive chat history.
|
|
46
|
+
- Read [references/exec-plans.md](references/exec-plans.md) before planning or updating any repository-mutating work.
|
|
47
|
+
- Read [references/sop-index.md](references/sop-index.md) to choose the right SOP for architecture, UI validation, observability, or knowledge capture work.
|
|
48
|
+
- Read [references/template-policy.md](references/template-policy.md) before overwriting existing files.
|
|
49
|
+
- Read [references/evaluation-loop.md](references/evaluation-loop.md) before changing the skill, templates, scripts, or policy references.
|
|
50
|
+
- Read [references/evidence-first-evals.md](references/evidence-first-evals.md) before designing evals for product correctness, frontend validation, or bug-discovery coverage.
|
|
51
|
+
- Read `docs/FRONTEND.md` and `docs/DESIGN.md` when they exist for frontend, UI, product design, visual design, canvas, or interface polish work.
|
|
52
|
+
|
|
53
|
+
## Command Rules
|
|
54
|
+
|
|
55
|
+
- Prefer `analyze` before `init`.
|
|
56
|
+
- Prefer the draft, test, evaluate, iterate loop for changes to this skill.
|
|
57
|
+
- Use `init` as the workspace entrypoint for both creation and reconciliation. It refreshes managed harness files when an existing managed harness is detected and preserves unmanaged user files. Use `--force` only when the human accepts overwriting.
|
|
58
|
+
- Do not overwrite existing files unless the human asked for it or you pass `--force`.
|
|
59
|
+
- Treat the generated files as starting points. After generation, tighten them with repository-specific details instead of leaving placeholders behind.
|
|
60
|
+
- Before plan close, replace or remove task placeholders such as "Define in-scope work", "Add the first concrete step", "Describe how the work will be verified", and any ad hoc durable-knowledge TODOs.
|
|
61
|
+
- Treat `docs/exec-plans/` as required durable state for repository-mutating work, not optional notes.
|
|
62
|
+
- Read `docs/exec-plans/workstreams.md` before resuming interrupted feature, refactor, reliability, security, frontend, or cleanup work.
|
|
63
|
+
- Treat `docs/sops/` as mechanical operating procedures, not background reading.
|
|
64
|
+
- When you answer a question using facts that are not yet in the repo but should be reusable, write them into a durable doc before finishing.
|
|
65
|
+
- Prefer `knowledge-mark-written --id ... --evidence-file ...` so durable docs can use natural wording without shell quoting failures or duplicated exact fact strings.
|
|
66
|
+
- The knowledge evidence text must exist verbatim in the destination doc; if it is only a paraphrase, write the durable doc first or use a file containing exact destination text.
|
|
67
|
+
- Use `defect-log` for every bug found by tests, evals, browser validation, or code review; unresolved defects must block handoff.
|
|
68
|
+
- Use `defect-resolve` only after the implementation is fixed and you can cite passing validation or code evidence.
|
|
69
|
+
- Use `acceptance-set` before implementation and `quality-score` before `plan-close`; include `--product-note`, `--ux-note`, `--architecture-note`, `--reliability-note`, and `--security-note`; failed or stale scores must drive rework, not handoff.
|
|
70
|
+
- Use `continuation-set` before every `plan-close`; choose `complete` for one-off plans, and use `continue` or `pause` for resumable multi-plan work. Invalid continuation input must fail before writing a half-valid workstream.
|
|
71
|
+
- Use `plan-close` as the final guardrail so plan state, quality score, and durable docs stay synchronized. When blocked, it returns JSON with `status`, `reason`, `message`, and `details`; use that output as the next repair input.
|
|
72
|
+
- Use `check` as the local handoff guardrail for user repositories. Active plans require ready Acceptance Contracts; completed plans require passing Quality Results scored against the current contract fingerprint.
|
|
73
|
+
- Use `evidence-prune` as a cleanup preview for old unreferenced files under `docs/generated/`; it never deletes unless `--apply` is present.
|
|
74
|
+
- Use `clean` when `.codex/skills/` or `docs/generated/` files need cleanup or were already committed. It never changes files or the git index unless `--apply` is present, and it must not remove execution plans, sidecars, or workstreams.
|
|
75
|
+
- Run `python3 evals/run_evals.py` after skill changes, read the structured report, and treat per-case failures as iteration input.
|
|
76
|
+
- Do not add CI to user repositories unless the human explicitly asks for it.
|
|
77
|
+
|
|
78
|
+
## Frontend Design Docs
|
|
79
|
+
|
|
80
|
+
Harness-engine has no external design runtime dependency and must not call an external design skill during init. It uses the local `/Users/murphy/code/github/design.md` checkout only as a reference for document shape.
|
|
81
|
+
|
|
82
|
+
For frontend repositories, `docs/FRONTEND.md` records product positioning, requested style direction, existing frontend code signals, scope, stack notes, validation expectations, controlled files, and read order. `docs/DESIGN.md` records the unified visual specification with YAML tokens and markdown rationale. For backend-only repositories, these files are not generated.
|
|
83
|
+
|
|
84
|
+
## Output Rules
|
|
85
|
+
|
|
86
|
+
- Keep `AGENTS.md` short and routing-oriented.
|
|
87
|
+
- Keep durable knowledge in repo docs, not in chat-only explanations.
|
|
88
|
+
- Keep plans under `docs/exec-plans/active/` and move finished plans to `docs/exec-plans/completed/`; plan Markdown and JSON sidecars are version-controlled project state.
|
|
89
|
+
- Keep resumable workstreams in `docs/exec-plans/workstreams.md`; this is version-controlled project state.
|
|
90
|
+
- Keep generated evidence under `docs/generated/`; it is local runtime output and is ignored by git unless the human intentionally promotes a specific artifact into tracked docs.
|
|
91
|
+
- Keep external, model-friendly references under `docs/references/`.
|
|
92
|
+
- Keep SOPs explicit and task-triggered so the next agent can follow the same path mechanically.
|
|
93
|
+
|
|
94
|
+
## Assets
|
|
95
|
+
|
|
96
|
+
- Scaffold templates live under [assets/repo-template](assets/repo-template).
|
|
97
|
+
- SOP starter docs live under [assets/sops](assets/sops).
|