codex-harness-engineering 0.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +73 -0
- package/README.md +136 -0
- package/docs/harness-engineering/implementation-playbook.md +370 -0
- package/docs/harness-engineering/index.md +61 -0
- package/docs/harness-engineering/research-note.md +318 -0
- package/docs/harness-engineering/sources.md +126 -0
- package/package.json +38 -0
- package/scripts/install-skills.mjs +104 -0
- package/scripts/publish.sh +139 -0
- package/scripts/verify-harness.mjs +175 -0
- package/skills/acceptance-contract/SKILL.md +78 -0
- package/skills/acceptance-contract/agents/openai.yaml +4 -0
- package/skills/cleanup-harness/SKILL.md +90 -0
- package/skills/cleanup-harness/agents/openai.yaml +4 -0
- package/skills/creator-harness/SKILL.md +124 -0
- package/skills/creator-harness/agents/openai.yaml +4 -0
- package/skills/creator-harness/references/harness-artifacts.md +302 -0
|
@@ -0,0 +1,302 @@
|
|
|
1
|
+
# Harness Artifact Templates
|
|
2
|
+
|
|
3
|
+
Use these templates selectively. Do not create every artifact by default.
|
|
4
|
+
|
|
5
|
+
Each artifact must answer at least one question:
|
|
6
|
+
|
|
7
|
+
- What should the agent know?
|
|
8
|
+
- What state survives context loss?
|
|
9
|
+
- What can the agent observe?
|
|
10
|
+
- How does the agent verify work?
|
|
11
|
+
- What constraint is mechanically enforced?
|
|
12
|
+
|
|
13
|
+
## Contents
|
|
14
|
+
|
|
15
|
+
- Minimal Repository Harness
|
|
16
|
+
- AGENTS.md
|
|
17
|
+
- progress.md
|
|
18
|
+
- feature_list.json
|
|
19
|
+
- init.sh
|
|
20
|
+
- Makefile
|
|
21
|
+
- Acceptance Contract
|
|
22
|
+
- Sprint Contract
|
|
23
|
+
- Evaluator Notes
|
|
24
|
+
- Cleanup Task
|
|
25
|
+
|
|
26
|
+
## Minimal Repository Harness
|
|
27
|
+
|
|
28
|
+
Start here unless a named failure mode requires more.
|
|
29
|
+
|
|
30
|
+
```text
|
|
31
|
+
AGENTS.md
|
|
32
|
+
README.md
|
|
33
|
+
progress.md
|
|
34
|
+
feature_list.json
|
|
35
|
+
init.sh
|
|
36
|
+
Makefile or task runner
|
|
37
|
+
tests/ or smoke test
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
Optional only when needed:
|
|
41
|
+
|
|
42
|
+
```text
|
|
43
|
+
docs/architecture.md
|
|
44
|
+
docs/product-spec.md
|
|
45
|
+
docs/tool-contracts.md
|
|
46
|
+
evals/
|
|
47
|
+
cleanup.md
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
## AGENTS.md
|
|
51
|
+
|
|
52
|
+
```markdown
|
|
53
|
+
# Agent Instructions
|
|
54
|
+
|
|
55
|
+
## Start Here
|
|
56
|
+
1. Read `README.md`.
|
|
57
|
+
2. Read latest entries in `progress.md`.
|
|
58
|
+
3. Check `feature_list.json`.
|
|
59
|
+
4. Run `./init.sh` or the standard setup command.
|
|
60
|
+
5. Run the cheapest smoke test before editing.
|
|
61
|
+
|
|
62
|
+
## Commands
|
|
63
|
+
- Setup:
|
|
64
|
+
- Test:
|
|
65
|
+
- Lint:
|
|
66
|
+
- Build:
|
|
67
|
+
- Smoke:
|
|
68
|
+
|
|
69
|
+
## Rules
|
|
70
|
+
- Keep changes scoped to the requested feature/fix.
|
|
71
|
+
- Update feature status only after verification passes.
|
|
72
|
+
- Record durable progress before ending a long session.
|
|
73
|
+
- Do not refactor unrelated code.
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
## progress.md
|
|
77
|
+
|
|
78
|
+
```markdown
|
|
79
|
+
# Progress
|
|
80
|
+
|
|
81
|
+
## YYYY-MM-DD
|
|
82
|
+
|
|
83
|
+
### Context
|
|
84
|
+
- Task:
|
|
85
|
+
- Current branch:
|
|
86
|
+
- Relevant files:
|
|
87
|
+
|
|
88
|
+
### Done
|
|
89
|
+
- ...
|
|
90
|
+
|
|
91
|
+
### Verification
|
|
92
|
+
- Command:
|
|
93
|
+
- Result:
|
|
94
|
+
|
|
95
|
+
### Open Issues
|
|
96
|
+
- ...
|
|
97
|
+
|
|
98
|
+
### Next
|
|
99
|
+
- ...
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
Keep entries short and recoverable. Prefer file paths, command names, failing
|
|
103
|
+
test names, and artifact paths over vague prose.
|
|
104
|
+
|
|
105
|
+
## feature_list.json
|
|
106
|
+
|
|
107
|
+
```json
|
|
108
|
+
[
|
|
109
|
+
{
|
|
110
|
+
"id": "F001",
|
|
111
|
+
"title": "Feature or capability",
|
|
112
|
+
"status": "not_started",
|
|
113
|
+
"acceptance": [
|
|
114
|
+
"User can ...",
|
|
115
|
+
"System rejects ...",
|
|
116
|
+
"Regression check passes ..."
|
|
117
|
+
],
|
|
118
|
+
"verify": [
|
|
119
|
+
"make test",
|
|
120
|
+
"make smoke"
|
|
121
|
+
],
|
|
122
|
+
"evidence": []
|
|
123
|
+
}
|
|
124
|
+
]
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
Use status values consistently: `not_started`, `in_progress`, `blocked`,
|
|
128
|
+
`verified`. Only set `verified` after listed checks pass.
|
|
129
|
+
|
|
130
|
+
## init.sh
|
|
131
|
+
|
|
132
|
+
```bash
|
|
133
|
+
#!/usr/bin/env bash
|
|
134
|
+
set -euo pipefail
|
|
135
|
+
|
|
136
|
+
# Keep this script idempotent. It should be safe for a new session to run first.
|
|
137
|
+
make setup
|
|
138
|
+
make smoke
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
## Makefile
|
|
142
|
+
|
|
143
|
+
```makefile
|
|
144
|
+
.PHONY: setup test lint build smoke verify
|
|
145
|
+
|
|
146
|
+
setup:
|
|
147
|
+
# install dependencies or prepare local environment
|
|
148
|
+
|
|
149
|
+
test:
|
|
150
|
+
# run unit tests
|
|
151
|
+
|
|
152
|
+
lint:
|
|
153
|
+
# run lint or structural checks
|
|
154
|
+
|
|
155
|
+
build:
|
|
156
|
+
# run build
|
|
157
|
+
|
|
158
|
+
smoke:
|
|
159
|
+
# run the cheapest end-to-end confidence check
|
|
160
|
+
|
|
161
|
+
verify: lint test build smoke
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
Keep command names stable. Agent instructions should point to these targets
|
|
165
|
+
instead of repeating long command lines across files.
|
|
166
|
+
|
|
167
|
+
## Acceptance Contract
|
|
168
|
+
|
|
169
|
+
Use this for a small bug or feature when planner/evaluator would be too much.
|
|
170
|
+
|
|
171
|
+
```markdown
|
|
172
|
+
# Acceptance Contract
|
|
173
|
+
|
|
174
|
+
## Scope
|
|
175
|
+
- Feature/fix:
|
|
176
|
+
- User-visible behavior:
|
|
177
|
+
- Likely files:
|
|
178
|
+
|
|
179
|
+
## Acceptance Criteria
|
|
180
|
+
- [ ] ...
|
|
181
|
+
- [ ] ...
|
|
182
|
+
|
|
183
|
+
## Verification
|
|
184
|
+
- Unit:
|
|
185
|
+
- Integration:
|
|
186
|
+
- Browser/API:
|
|
187
|
+
- Log/metric/trace:
|
|
188
|
+
|
|
189
|
+
## Out of Scope
|
|
190
|
+
- ...
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
## Sprint Contract
|
|
194
|
+
|
|
195
|
+
Use this when work spans multiple files, runtime behavior, or subjective quality.
|
|
196
|
+
|
|
197
|
+
```markdown
|
|
198
|
+
# Sprint Contract
|
|
199
|
+
|
|
200
|
+
## Scope
|
|
201
|
+
- Feature:
|
|
202
|
+
- User path:
|
|
203
|
+
- API/data path:
|
|
204
|
+
- Likely files/modules:
|
|
205
|
+
|
|
206
|
+
## Done Means
|
|
207
|
+
- [ ] User can ...
|
|
208
|
+
- [ ] API or data reflects ...
|
|
209
|
+
- [ ] Error state handles ...
|
|
210
|
+
- [ ] No regression in ...
|
|
211
|
+
|
|
212
|
+
## Verification
|
|
213
|
+
- Unit:
|
|
214
|
+
- Integration:
|
|
215
|
+
- Browser/API:
|
|
216
|
+
- Log/metric/trace:
|
|
217
|
+
|
|
218
|
+
## Evaluator Focus
|
|
219
|
+
- Runtime behavior:
|
|
220
|
+
- Negative cases:
|
|
221
|
+
- UX or quality concerns:
|
|
222
|
+
|
|
223
|
+
## Out of Scope
|
|
224
|
+
- ...
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
If the sprint contract becomes longer than the work, split the work or fall back
|
|
228
|
+
to a smaller acceptance contract.
|
|
229
|
+
|
|
230
|
+
## Evaluator Notes
|
|
231
|
+
|
|
232
|
+
Use this when generator self-review is not enough.
|
|
233
|
+
|
|
234
|
+
```markdown
|
|
235
|
+
# Evaluator Notes
|
|
236
|
+
|
|
237
|
+
## Contract
|
|
238
|
+
- Sprint:
|
|
239
|
+
- Expected behavior:
|
|
240
|
+
|
|
241
|
+
## Checks Run
|
|
242
|
+
- Command/check:
|
|
243
|
+
- Result:
|
|
244
|
+
- Artifact:
|
|
245
|
+
|
|
246
|
+
## Findings
|
|
247
|
+
- [ ] P0/P1/P2:
|
|
248
|
+
- Evidence:
|
|
249
|
+
- Repro:
|
|
250
|
+
- Suggested next step:
|
|
251
|
+
|
|
252
|
+
## Verdict
|
|
253
|
+
- pass/fail:
|
|
254
|
+
- Reason:
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
Evaluator feedback should cite observed evidence: screenshots, DOM state, API
|
|
258
|
+
response, database state, logs, traces, or command output.
|
|
259
|
+
|
|
260
|
+
## Legibility Map
|
|
261
|
+
|
|
262
|
+
Use this when the agent cannot see enough runtime behavior.
|
|
263
|
+
|
|
264
|
+
```markdown
|
|
265
|
+
# Legibility Map
|
|
266
|
+
|
|
267
|
+
| Area | Signal | How to collect | Owner/check |
|
|
268
|
+
| --- | --- | --- | --- |
|
|
269
|
+
| UI | Screenshot/DOM | | |
|
|
270
|
+
| API | Request/response | | |
|
|
271
|
+
| Backend runtime | Structured log/trace | | |
|
|
272
|
+
| Data | Schema/query/seed | | |
|
|
273
|
+
| Build | Build log/CI log | | |
|
|
274
|
+
| Architecture | Lint/structural test | | |
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
## Cleanup Task
|
|
278
|
+
|
|
279
|
+
Use this when agent throughput creates repeated drift.
|
|
280
|
+
|
|
281
|
+
```markdown
|
|
282
|
+
# Cleanup Task
|
|
283
|
+
|
|
284
|
+
## Trigger
|
|
285
|
+
- Repeated pattern:
|
|
286
|
+
- Evidence:
|
|
287
|
+
|
|
288
|
+
## Scope
|
|
289
|
+
- Include:
|
|
290
|
+
- Exclude:
|
|
291
|
+
|
|
292
|
+
## Acceptance Criteria
|
|
293
|
+
- [ ] ...
|
|
294
|
+
|
|
295
|
+
## Verification
|
|
296
|
+
- Lint:
|
|
297
|
+
- Test:
|
|
298
|
+
- Smoke:
|
|
299
|
+
|
|
300
|
+
## Rollback
|
|
301
|
+
- ...
|
|
302
|
+
```
|