npm - codex-harness-engineering - Versions diffs - 0.1.4 → 0.1.6 - Mend

codex-harness-engineering 0.1.4 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (34) hide show

package/AGENTS.md +18 -6
package/LICENSE +21 -0
package/README.md +69 -6
package/docs/harness-engineering/implementation-playbook.md +232 -286
package/docs/harness-engineering/index.md +7 -4
package/docs/harness-engineering/research-note.md +294 -274
package/docs/harness-engineering/sources.md +166 -72
package/package.json +9 -4
package/scripts/install-skills.mjs +73 -15
package/scripts/publish.sh +2 -2
package/scripts/verify-harness.mjs +61 -4
package/skills/acceptance-contract/SKILL.md +39 -49
package/skills/acceptance-contract/agents/openai.yaml +2 -2
package/skills/cleanup-harness/SKILL.md +48 -59
package/skills/cleanup-harness/agents/openai.yaml +2 -2
package/skills/creator-harness/SKILL.md +79 -95
package/skills/creator-harness/agents/openai.yaml +2 -2
package/skills/creator-harness/references/harness-artifacts.md +63 -62
package/skills/lessons-harness/SKILL.md +68 -0
package/skills/lessons-harness/agents/openai.yaml +4 -0
package/templates/harness/AGENTS.md +77 -0
package/templates/harness/feature_list.json +16 -0
package/templates/harness/init.sh +15 -0
package/templates/harness/lessons.md +18 -0
package/templates/harness/memory/README.md +22 -0
package/templates/harness/progress.md +33 -0
package/templates/harness/rotate-state.mjs +131 -0
package/templates/harness/verify-state.mjs +117 -0
package/templates/team/roles/evaluator.md +43 -0
package/templates/team/roles/implementer.md +29 -0
package/templates/team/roles/planner.md +28 -0
package/templates/team/sprint-template.md +36 -0
package/templates/team/verify-team.mjs +71 -0
package/templates/team/workflow.md +62 -0

package/skills/acceptance-contract/SKILL.md CHANGED Viewed

@@ -1,50 +1,38 @@
 ---
 name: acceptance-contract
-description: Use when a user asks to define success criteria, clarify scope, prevent premature done claims, or prepare an AI agent/coding agent task before implementation.
+description: Dùng khi cần chốt tiêu chí "done", làm rõ phạm vi, ngăn agent tuyên bố hoàn thành sớm, hoặc chuẩn bị một task cho coding agent trước khi triển khai.
 ---
 # Acceptance Contract
-## Core Principle
+## Quy trình
-Turn an unclear request into a small, verifiable contract before implementation.
-Use this skill when "done" is ambiguous, the task could drift, or an agent may
-claim completion without evidence.
+1. Nêu các giả định trong một danh sách ngắn.
+2. Chỉ ra mọi điểm nhập nhằng làm thay đổi cách triển khai hoặc verification.
+3. Giữ phạm vi nhỏ hơn phần việc triển khai.
+4. Định nghĩa hành vi nhìn thấy được ở phía người dùng hoặc hệ thống.
+5. Định nghĩa tiêu chí nghiệm thu kiểm chứng được.
+6. Định nghĩa lệnh verification hoặc tín hiệu quan sát được.
+7. Đánh dấu non-goals để agent không mở rộng task.
+8. Chỉ triển khai sau khi contract đủ rõ để verify.
-In this repository, follow the local source policy: use only `[S1]-[S5]` for
-harness claims. Read `docs/harness-engineering/sources.md` only when you need to
-check that policy. For templates, prefer the relevant section of
-`docs/harness-engineering/implementation-playbook.md` instead of loading the
-whole research note.
+Nếu thông tin còn thiếu không thể suy luận an toàn, hãy hỏi một câu ngắn gọn
+trước khi viết code.
-## Workflow
-1. State assumptions in one short list.
-2. Name any ambiguity that changes implementation or verification.
-3. Keep the scope smaller than the implementation work.
-4. Define user-visible or system-visible behavior.
-5. Define acceptance criteria that can be checked.
-6. Define verification commands or observable signals.
-7. Mark non-goals so the agent does not widen the task.
-8. Implement only after the contract is clear enough to verify.
-If the missing information cannot be inferred safely, ask one concise question
-before writing code.
-## Contract Template
+## Mẫu Contract
 ```markdown
 # Acceptance Contract
-## Assumptions
+## Giả định
 - ...
-## Scope
+## Phạm vi
 - Feature/fix:
-- User-visible behavior:
-- Likely files:
+- Hành vi nhìn thấy phía người dùng:
+- File có khả năng đụng đến:
-## Acceptance Criteria
+## Tiêu chí nghiệm thu
 - [ ] ...
 - [ ] ...
@@ -54,25 +42,27 @@ before writing code.
 - Browser/API:
 - Log/metric/trace:
-## Out of Scope
+## Ngoài phạm vi
 - ...
 ```
-## Verification Rules
-- Prefer an existing project command over a new script.
-- For code changes, run the narrowest test that proves the criteria.
-- For UI/runtime behavior, use browser, API, log, metric, trace, or screenshot
-  evidence when available.
-- Do not mark criteria done until verification has run or the skipped check is
-  explicitly explained.
-## Source Mapping
-- Small tasks should use the simplest sufficient workflow [S3].
-- Long-running agent tasks need state and verification to avoid early done
-  claims [S2].
-- Runtime-visible checks improve agent feedback loops [S1], [S2], [S4].
-- Sprint contracts and evaluator criteria help when task quality is subjective
-  or multi-step [S4].
-- Trajectory evaluation and LLM-as-a-judge monitor execution path quality, and AutoHarness enforces constraints when manual rules are too complex [S5].
+## Quy tắc verification
+- Ưu tiên một lệnh sẵn có của dự án hơn là viết script mới.
+- Với thay đổi code, chạy bài test hẹp nhất chứng minh được tiêu chí.
+- Với hành vi UI/runtime, dùng bằng chứng browser, API, log, metric, trace,
+  hoặc screenshot khi có.
+- Không đánh dấu tiêu chí là done cho tới khi verification đã chạy, hoặc nêu rõ
+  lý do nếu một kiểm tra bị bỏ qua.
+## Ánh xạ nguồn
+- Task nhỏ nên dùng quy trình đơn giản đủ dùng nhất [S3].
+- Task agent chạy dài cần state và verification để tránh tuyên bố "done" sớm
+  [S2].
+- Các kiểm tra nhìn thấy được lúc runtime cải thiện feedback loop của agent
+  [S1], [S2], [S4].
+- Sprint contract và tiêu chí của evaluator giúp ích khi chất lượng task mang
+  tính chủ quan hoặc nhiều bước [S4].
+- AutoHarness tổng hợp một code wrapper để cưỡng chế ràng buộc khi quy tắc thủ
+  công quá phức tạp [S5].

package/skills/acceptance-contract/agents/openai.yaml CHANGED Viewed

@@ -1,4 +1,4 @@
 interface:
   display_name: "Acceptance Contract"
-  short_description: "Define scope, done criteria, and checks"
-  default_prompt: "Use $acceptance-contract to define scope, acceptance criteria, and verification for this task."
+  short_description: "Chốt phạm vi, tiêu chí done, và kiểm tra"
+  default_prompt: "Dùng $acceptance-contract để chốt phạm vi, tiêu chí nghiệm thu, và verification cho task này."

package/skills/cleanup-harness/SKILL.md CHANGED Viewed

@@ -1,90 +1,79 @@
 ---
 name: cleanup-harness
-description: Use when a user asks to design, scope, or run cleanup for agent-created code, documentation drift, repeated review defects, architecture drift, or accumulated harness debt.
+description: Dùng khi cần thiết kế, chốt phạm vi, hoặc thực hiện cleanup cho code do agent tạo, tài liệu trôi (drift), defect review lặp lại, kiến trúc trôi, hoặc nợ harness tích tụ.
 ---
 # Cleanup Harness
-## Core Principle
+## Điều kiện kích hoạt
-Treat cleanup as a scoped harness task, not opportunistic refactoring. Cleanup
-needs a trigger, acceptance criteria, verification, and rollback path because
-high agent throughput can spread weak patterns quickly.
+Chỉ bắt đầu một task cleanup khi thấy ít nhất một trigger:
-In this repository, follow the local source policy: use only `[S1]-[S5]` for
-harness claims. Read `docs/harness-engineering/sources.md` only when you need to
-check that policy. For cleanup templates, prefer the relevant section of
-`docs/harness-engineering/implementation-playbook.md` instead of loading the
-whole research note.
+- cùng một helper, workaround, hoặc pattern xuất hiện lặp lại;
+- một feature đi vòng qua ranh giới kiến trúc;
+- log tiến độ lặp lại cùng một lỗi;
+- feedback từ evaluator hoặc review bắt cùng một lớp defect nhiều lần;
+- docs, index, hoặc `AGENTS.md` trôi khỏi trạng thái thực của repo;
+- công việc mới thêm code workaround thay vì sửa nguyên nhân gốc.
-## Cleanup Triggers
+Nếu không thấy trigger nào, hãy nêu vấn đề tiềm năng nhưng đừng sửa code không
+liên quan.
-Start a cleanup task only when at least one trigger is visible:
+## Quy trình
-- the same helper, workaround, or pattern appears repeatedly;
-- a feature bypasses an architecture boundary;
-- progress logs repeat the same failure;
-- evaluator or review feedback catches the same defect class multiple times;
-- docs, indexes, or `AGENTS.md` drift from the repository state;
-- new work adds workaround code instead of fixing the cause.
+1. Xác định trigger cụ thể và bằng chứng.
+2. Chốt phạm vi cleanup nhỏ nhất loại bỏ được vấn đề lặp lại.
+3. Liệt kê các file có khả năng thay đổi.
+4. Định nghĩa tiêu chí nghiệm thu.
+5. Định nghĩa lệnh verification hoặc tín hiệu quan sát được.
+6. Chỉ dọn nợ nằm trong phạm vi đã khai báo.
+7. Chuyển phán đoán lặp lại thành một guardrail cơ học khi khả thi.
+8. Ghi lại những gì đã verify và rủi ro còn lại.
-If no trigger is visible, mention the potential issue but do not edit unrelated
-code.
-## Workflow
-1. Identify the concrete trigger and evidence.
-2. Define the smallest cleanup scope that removes the repeated problem.
-3. List files likely to change.
-4. Define acceptance criteria.
-5. Define verification commands or observable signals.
-6. Remove only debt inside the declared scope.
-7. Convert repeated judgment into a mechanical guardrail when practical.
-8. Record what was verified and any residual risk.
-## Cleanup Task Template
+## Mẫu Cleanup Task
 ```markdown
 # Cleanup Task
 ## Trigger
-- Evidence:
+- Bằng chứng:
-## Scope
-- Clean up:
-- Likely files:
+## Phạm vi
+- Cần dọn:
+- File có khả năng đụng đến:
-## Acceptance Criteria
-- [ ] Duplicate or drift source is removed.
-- [ ] Behavior remains unchanged unless explicitly requested.
-- [ ] Guardrail is added or the reason for not adding one is stated.
+## Tiêu chí nghiệm thu
+- [ ] Nguồn gây trùng lặp hoặc drift đã được loại bỏ.
+- [ ] Hành vi giữ nguyên trừ khi được yêu cầu rõ ràng.
+- [ ] Đã thêm guardrail, hoặc nêu rõ lý do không thêm.
 ## Verification
 - Tests:
-- Lint/structural check:
-- Runtime check:
+- Kiểm tra lint/cấu trúc:
+- Kiểm tra runtime:
 ## Rollback
-- Safe restore point:
+- Điểm khôi phục an toàn:
 ```
-## Guardrail Guidance
+## Hướng dẫn guardrail
-Prefer a mechanical check when the same issue is likely to recur:
+Ưu tiên một kiểm tra cơ học khi vấn đề có khả năng tái diễn:
-- lint or structural test for architecture boundaries;
-- doc/index freshness check for repository source of truth;
-- smoke test for setup or runtime drift;
-- evaluator rubric for repeated subjective quality failures.
+- lint hoặc structural test cho ranh giới kiến trúc;
+- kiểm tra độ tươi của doc/index cho nguồn chân lý của repo;
+- smoke test cho drift trong setup hoặc runtime;
+- rubric evaluator cho lỗi chất lượng chủ quan lặp lại.
-Do not add broad rules that protect no concrete invariant.
+Không thêm quy tắc rộng mà không bảo vệ một invariant cụ thể nào.
-## Source Mapping
+## Ánh xạ nguồn
-- Cleanup is part of repository-level harness maintenance when throughput
-  increases entropy [S1].
-- Mechanical guardrails are stronger than prose for repeated invariants [S1].
-- Keep the intervention as simple as the failure mode allows [S3].
-- Long-running work benefits from explicit state, verification, and recovery
-  points [S2], [S4].
-- AutoHarness can automatically enforce code constraints to reduce cleanup debt, and trajectory evaluation tracks whether cleanup alters agent execution paths [S5].
+- Cleanup là một phần của bảo trì harness cấp repo khi throughput làm tăng
+  entropy [S1].
+- Guardrail cơ học mạnh hơn văn xuôi cho các invariant lặp lại [S1].
+- Giữ can thiệp đơn giản đúng mức mà failure mode cho phép [S3].
+- Công việc chạy dài hưởng lợi từ state, verification, và điểm khôi phục tường
+  minh [S2], [S4].
+- AutoHarness có thể tự tổng hợp ràng buộc dạng code để giảm nợ cleanup khi quy
+  tắc thủ công quá phức tạp [S5].

package/skills/cleanup-harness/agents/openai.yaml CHANGED Viewed

@@ -1,4 +1,4 @@
 interface:
   display_name: "Cleanup Harness"
-  short_description: "Scope cleanup with triggers and checks"
-  default_prompt: "Use $cleanup-harness to scope a cleanup task with trigger evidence, acceptance criteria, and verification."
+  short_description: "Chốt phạm vi cleanup theo trigger và kiểm tra"
+  default_prompt: "Dùng $cleanup-harness để chốt phạm vi một task cleanup với bằng chứng trigger, tiêu chí nghiệm thu, và verification."

package/skills/creator-harness/SKILL.md CHANGED Viewed

@@ -1,108 +1,92 @@
 ---
 name: creator-harness
-description: Use when a user asks to create, design, audit, or improve a harness for AI agents, coding agents, long-running work, eval loops, repository workflows, or agent operating procedures.
+description: Dùng khi cần tạo, thiết kế, audit, hoặc cải thiện một harness cho AI agent, coding agent, công việc chạy dài, eval loop, workflow repository, hoặc quy trình vận hành của agent.
 ---
 # Creator Harness
-## Core Principle
-Create the smallest harness that changes agent behavior. A harness is the
-control plane around an agent: durable state, readable tools, verification
-loops, evaluator feedback when needed, and mechanical guardrails.
-Use only the local five-source research as the source of truth:
-- `docs/harness-engineering/sources.md`
-- `docs/harness-engineering/research-note.md`
-- `docs/harness-engineering/implementation-playbook.md`
-Do not introduce external harness resources unless the user explicitly asks to
-expand beyond the five OpenAI/Anthropic/Google articles.
-## Working Rules
-1. State assumptions before creating files. If the target agent, runtime, or
-   success criteria are unknowable, ask one concise question.
-2. Start with a single-agent harness plus state and verification. Add planner,
-   evaluator, telemetry, or cleanup automation only when a named failure mode
-   requires it.
-3. Touch only harness artifacts unless the user explicitly asks for product code
-   changes.
-4. Every harness artifact must answer at least one question: What should the
-   agent know? What state survives context loss? What can it observe? How does
-   it verify? What invariant is mechanically enforced?
-5. Convert important preferences into checks where practical: tests, lint,
-   scripts, CI jobs, evaluator rubrics, or reviewer contracts.
-6. For one-shot Markdown or research-note edits in this repository, do not start
-   autonomous loops unless the user explicitly requests them.
-## Design Workflow
-1. Inventory existing harness surface:
-   - `AGENTS.md`, `README.md`, architecture docs, product specs;
-   - setup scripts, task runner, CI, tests, smoke tests;
-   - progress logs, feature lists, todos, research state;
-   - eval prompts, evaluator rubrics, screenshots, traces, telemetry;
-   - tool contracts, permissions, escalation rules.
-2. Name the failure modes:
-   - lost context across sessions;
-   - early "done" claims;
-   - weak runtime observability;
-   - overbroad implementation;
-   - self-evaluation optimism;
-   - architecture drift;
-   - cleanup debt from high agent throughput.
-3. Pick the minimal intervention:
-   - unclear task: acceptance contract;
-   - lost context: `progress.md`, `feature_list.json`, git protocol;
-   - broken environment: `init.sh`, smoke test;
-   - invisible runtime: browser/API/log/metric/trace checks;
-   - weak self-review: evaluator rubric or separate evaluator pass;
-   - drift: structural lint or architecture test;
-   - throughput entropy: targeted cleanup task with verification;
-   - complex constraints: AutoHarness synthesized code wrapper [S5];
-   - agent trajectory drift: Trajectory Evaluation and LLM-as-a-judge [S5].
-4. Write a harness contract:
-   - agent role and allowed scope;
-   - durable state files;
-   - required tools and observable signals;
-   - verification commands;
-   - loop cadence;
-   - stop/escalation conditions;
-   - out-of-scope work.
-5. Create only the needed files. For templates, read
+## Quy tắc làm việc
+1. Nêu giả định trước khi tạo file. Nếu không thể biết target agent, runtime,
+   hoặc tiêu chí thành công, hãy hỏi một câu ngắn gọn.
+2. Bắt đầu với harness single-agent kèm state và verification. Chỉ thêm planner,
+   evaluator, telemetry, hoặc cleanup automation khi một failure mode có tên đòi
+   hỏi.
+3. Chỉ đụng đến artifact harness trừ khi người dùng yêu cầu rõ ràng thay đổi
+   code sản phẩm.
+4. Mỗi artifact harness phải trả lời ít nhất một câu hỏi: Agent cần biết gì?
+   State nào sống sót qua mất context? Agent quan sát được gì? Verify thế nào?
+   Invariant nào được cưỡng chế cơ học?
+5. Chuyển các preference quan trọng thành kiểm tra khi khả thi: test, lint,
+   script, CI job, rubric evaluator, hoặc reviewer contract.
+6. Với chỉnh sửa Markdown hoặc research note một lần trong repo này, không khởi
+   động autonomous loop trừ khi người dùng yêu cầu rõ ràng.
+## Quy trình thiết kế
+1. Kiểm kê bề mặt harness hiện có:
+   - `AGENTS.md`, `README.md`, docs kiến trúc, product spec;
+   - setup script, task runner, CI, test, smoke test;
+   - log tiến độ, feature list, todo, research state;
+   - eval prompt, rubric evaluator, screenshot, trace, telemetry;
+   - tool contract, permission, quy tắc escalation.
+2. Đặt tên cho các failure mode:
+   - mất context qua các session;
+   - tuyên bố "done" sớm;
+   - khả năng quan sát runtime yếu;
+   - triển khai quá rộng;
+   - lạc quan khi tự đánh giá;
+   - kiến trúc trôi;
+   - nợ cleanup do throughput cao của agent.
+3. Chọn can thiệp tối thiểu:
+   - task chưa rõ: acceptance contract;
+   - mất context: `progress.md`, `feature_list.json`, git protocol;
+   - môi trường hỏng: `init.sh`, smoke test;
+   - runtime không nhìn thấy: kiểm tra browser/API/log/metric/trace;
+   - tự review yếu: rubric evaluator hoặc một lượt evaluator riêng;
+   - drift: structural lint hoặc architecture test;
+   - entropy do throughput: task cleanup có phạm vi kèm verification;
+   - ràng buộc phức tạp: code wrapper tổng hợp bởi AutoHarness [S5].
+4. Viết một harness contract:
+   - vai trò agent và phạm vi cho phép;
+   - các file state bền vững;
+   - tool cần dùng và tín hiệu quan sát được;
+   - lệnh verification;
+   - nhịp loop;
+   - điều kiện stop/escalation;
+   - công việc ngoài phạm vi.
+5. Chỉ tạo những file cần thiết. Khi cần mẫu, đọc
    `references/harness-artifacts.md`.
-6. Verify the harness:
-   - run syntax/format validators for files created;
-   - run the declared smoke test if one exists;
-   - run the placeholder and citation scan from `AGENTS.md`;
-   - verify no recurring automation was created for a one-shot documentation
-     task;
-   - if editing this skill, validate the skill if a validator exists locally.
+6. Verify harness:
+   - chạy validator cú pháp/định dạng cho các file đã tạo;
+   - chạy smoke test đã khai báo nếu có;
+   - chạy lượt quét placeholder và citation theo `AGENTS.md`;
+   - kiểm tra không có automation định kỳ nào bị tạo cho một task tài liệu một
+     lần;
+   - nếu chỉnh sửa chính skill này, validate skill nếu có validator cục bộ.
-## Harness Types
+## Các loại harness
-| Situation                   | Default harness                                           |
-| --------------------------- | --------------------------------------------------------- |
-| Small bug or feature        | Acceptance criteria and a verification command            |
-| Multi-session coding        | `progress.md`, `feature_list.json`, `init.sh`, smoke test |
-| UI/runtime-heavy app        | Sprint contract, browser/API checks, evaluator notes      |
-| Long application build      | Planner, generator, evaluator, sprint contract            |
-| Architecture-sensitive repo | Dependency rules, structural tests, cleanup cadence       |
-| Complex or rule-heavy env   | AutoHarness (wrapper), Trajectory evaluation / VeRO       |
+| Tình huống                   | Harness mặc định                                          |
+| ---------------------------- | --------------------------------------------------------- |
+| Bug/feature nhỏ              | Tiêu chí nghiệm thu và một lệnh verification              |
+| Code nhiều session           | `progress.md`, `feature_list.json`, `init.sh`, smoke test |
+| App nặng UI/runtime          | Sprint contract, kiểm tra browser/API, ghi chú evaluator  |
+| Build ứng dụng dài           | Planner, generator, evaluator, sprint contract            |
+| Repo nhạy kiến trúc          | Dependency rule, structural test, nhịp cleanup            |
+| Môi trường phức tạp/nhiều luật | Code wrapper tổng hợp bởi AutoHarness                     |
-## Output Shape
+## Hình thức đầu ra
-When answering without file edits, produce:
+Khi trả lời mà không sửa file, hãy tạo:
 ```markdown
-## Assumptions
+## Giả định
 - ...
@@ -110,15 +94,15 @@ When answering without file edits, produce:
 - ...
-## Minimal Harness
+## Harness tối thiểu
 - Artifact:
-- Purpose:
+- Mục đích:
 - Verification:
-## Next Step
+## Bước tiếp theo
 - ...
 ```
-When editing files, summarize changed files and verification run.
+Khi sửa file, tóm tắt các file đã đổi và verification đã chạy.

package/skills/creator-harness/agents/openai.yaml CHANGED Viewed

@@ -1,4 +1,4 @@
 interface:
   display_name: "Creator Harness"
-  short_description: "Design practical agent harnesses"
-  default_prompt: "Use $creator-harness to design a minimal harness for this repository."
+  short_description: "Thiết kế harness thực dụng cho agent"
+  default_prompt: "Dùng $creator-harness để thiết kế một harness tối thiểu cho repository này."