@codexstar/bug-hunter 3.0.0 → 3.0.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +149 -83
- package/README.md +150 -15
- package/SKILL.md +94 -27
- package/agents/openai.yaml +4 -0
- package/bin/bug-hunter +9 -3
- package/docs/images/2026-03-12-fix-plan-rollout.png +0 -0
- package/docs/images/2026-03-12-hero-bug-hunter-overview.png +0 -0
- package/docs/images/2026-03-12-machine-readable-artifacts.png +0 -0
- package/docs/images/2026-03-12-pr-review-flow.png +0 -0
- package/docs/images/2026-03-12-security-pack.png +0 -0
- package/docs/images/adversarial-debate.png +0 -0
- package/docs/images/doc-verify-fix-plan.png +0 -0
- package/docs/images/hero.png +0 -0
- package/docs/images/pipeline-overview.png +0 -0
- package/docs/images/security-finding-card.png +0 -0
- package/docs/plans/2026-03-11-structured-output-migration-plan.md +288 -0
- package/docs/plans/2026-03-12-audit-bug-fixes-surgical-plan.md +193 -0
- package/docs/plans/2026-03-12-enterprise-security-pack-e2e-plan.md +59 -0
- package/docs/plans/2026-03-12-local-security-skills-integration-plan.md +39 -0
- package/docs/plans/2026-03-12-pr-review-strategic-fix-flow.md +78 -0
- package/evals/evals.json +366 -102
- package/modes/extended.md +2 -2
- package/modes/fix-loop.md +30 -30
- package/modes/fix-pipeline.md +32 -6
- package/modes/large-codebase.md +14 -15
- package/modes/local-sequential.md +44 -20
- package/modes/loop.md +56 -56
- package/modes/parallel.md +3 -3
- package/modes/scaled.md +2 -2
- package/modes/single-file.md +3 -3
- package/modes/small.md +11 -11
- package/package.json +10 -1
- package/prompts/fixer.md +37 -23
- package/prompts/hunter.md +39 -20
- package/prompts/referee.md +34 -20
- package/prompts/skeptic.md +25 -22
- package/schemas/coverage.schema.json +67 -0
- package/schemas/examples/findings.invalid.json +13 -0
- package/schemas/examples/findings.valid.json +17 -0
- package/schemas/findings.schema.json +76 -0
- package/schemas/fix-plan.schema.json +94 -0
- package/schemas/fix-report.schema.json +105 -0
- package/schemas/fix-strategy.schema.json +99 -0
- package/schemas/recon.schema.json +31 -0
- package/schemas/referee.schema.json +46 -0
- package/schemas/shared.schema.json +51 -0
- package/schemas/skeptic.schema.json +21 -0
- package/scripts/bug-hunter-state.cjs +35 -12
- package/scripts/code-index.cjs +11 -4
- package/scripts/fix-lock.cjs +95 -25
- package/scripts/payload-guard.cjs +24 -10
- package/scripts/pr-scope.cjs +181 -0
- package/scripts/render-report.cjs +346 -0
- package/scripts/run-bug-hunter.cjs +667 -32
- package/scripts/schema-runtime.cjs +273 -0
- package/scripts/schema-validate.cjs +40 -0
- package/scripts/tests/bug-hunter-state.test.cjs +68 -3
- package/scripts/tests/code-index.test.cjs +15 -0
- package/scripts/tests/fix-lock.test.cjs +60 -2
- package/scripts/tests/fixtures/flaky-worker.cjs +6 -1
- package/scripts/tests/fixtures/low-confidence-worker.cjs +8 -2
- package/scripts/tests/fixtures/success-worker.cjs +6 -1
- package/scripts/tests/payload-guard.test.cjs +154 -2
- package/scripts/tests/pr-scope.test.cjs +212 -0
- package/scripts/tests/render-report.test.cjs +180 -0
- package/scripts/tests/run-bug-hunter.test.cjs +686 -2
- package/scripts/tests/security-skills-integration.test.cjs +29 -0
- package/scripts/tests/skills-packaging.test.cjs +30 -0
- package/scripts/tests/worktree-harvest.test.cjs +66 -0
- package/scripts/worktree-harvest.cjs +62 -9
- package/skills/README.md +19 -0
- package/skills/commit-security-scan/SKILL.md +63 -0
- package/skills/security-review/SKILL.md +57 -0
- package/skills/threat-model-generation/SKILL.md +47 -0
- package/skills/vulnerability-validation/SKILL.md +59 -0
- package/templates/subagent-wrapper.md +12 -3
- package/modes/_dispatch.md +0 -121
|
@@ -0,0 +1,288 @@
|
|
|
1
|
+
# Canonical Structured Outputs For Bug Hunter
|
|
2
|
+
|
|
3
|
+
This ExecPlan is a living document. The sections `Progress`, `Surprises & Discoveries`, `Decision Log`, and `Outcomes & Retrospective` must be kept up to date as work proceeds.
|
|
4
|
+
|
|
5
|
+
This repository does not contain a checked-in `PLANS.md`, but this document is written to the same standard as the machine-local ExecPlan reference at `/Users/codex/Downloads/Code Files/PLANS.md`. Keep this plan self-contained as implementation proceeds.
|
|
6
|
+
|
|
7
|
+
## Purpose / Big Picture
|
|
8
|
+
|
|
9
|
+
After this change, Bug Hunter will use one canonical structured contract from end to end. Each phase will emit validated JSON as the source of truth, while Markdown becomes a rendered report for humans. This matters because the current system mixes Markdown prompts, ad hoc parsing, and JSON side channels, which makes the pipeline slower, harder to validate, and more likely to drift into false positives, silent false negatives, or broken fix eligibility.
|
|
10
|
+
|
|
11
|
+
The user-visible result is simple to verify. A bug-hunter run should create phase artifacts such as `.bug-hunter/recon.json`, `.bug-hunter/findings.json`, `.bug-hunter/skeptic.json`, `.bug-hunter/referee.json`, `.bug-hunter/coverage.json`, and `.bug-hunter/fix-report.json`. The same run should still produce readable Markdown reports, but those Markdown files must be generated from the JSON artifacts rather than being the only source of truth. A failed or malformed phase output should be rejected immediately with a precise validation error and a retry path instead of slipping through as an empty or partially parsed report.
|
|
12
|
+
|
|
13
|
+
## Progress
|
|
14
|
+
|
|
15
|
+
- [x] (2026-03-11 18:40Z) Create versioned JSON schemas for `recon`, `findings`, `skeptic`, `referee`, `coverage`, `fix-report`, plus shared definitions under `schemas/`.
|
|
16
|
+
- [x] (2026-03-11 18:40Z) Add `scripts/schema-runtime.cjs` and `scripts/schema-validate.cjs`, ship `schemas/` in the npm package, and add example valid/invalid `findings.json` fixtures.
|
|
17
|
+
- [x] (2026-03-11 18:40Z) Wire strict findings validation into `payload-guard.cjs`, `bug-hunter-state.cjs`, and `run-bug-hunter.cjs`, including retry-on-invalid-findings inside the chunk worker loop.
|
|
18
|
+
- [x] (2026-03-11 20:05Z) Replace Markdown-only phase prompting with JSON-first prompting plus rendered Markdown output guidance, including `scripts/render-report.cjs`.
|
|
19
|
+
- [x] (2026-03-11 20:05Z) Normalize confidence to numeric values in canonical findings/referee contracts and fix-plan eligibility.
|
|
20
|
+
- [x] (2026-03-11 20:05Z) Replace `coverage.md` as canonical loop state with `coverage.json` and keep `coverage.md` as a derived summary.
|
|
21
|
+
- [x] (2026-03-10 21:06Z) Add strict inbound and outbound validation, retry logic, and eval coverage for malformed outputs and stale contracts.
|
|
22
|
+
- [x] (2026-03-11 20:05Z) Update core documentation, mode docs, wrapper templates, and eval text so they match the full-queue loop semantics and the new structured contracts.
|
|
23
|
+
|
|
24
|
+
## Surprises & Discoveries
|
|
25
|
+
|
|
26
|
+
- Observation: the orchestrator already has a JSON worker path, but the main prompts still tell agents to write Markdown reports.
|
|
27
|
+
Evidence: `scripts/run-bug-hunter.cjs` writes and reads `chunk-<id>-findings.json`, while `prompts/hunter.md` still directs output to `.bug-hunter/findings.md`.
|
|
28
|
+
|
|
29
|
+
- Observation: fix planning expects numeric confidence, but the Referee prompt still emits `High/Medium/Low`.
|
|
30
|
+
Evidence: `scripts/run-bug-hunter.cjs` filters fix eligibility with `confidence >= confidenceThreshold`, while `prompts/referee.md` asks for `Confidence: High/Medium/Low`.
|
|
31
|
+
|
|
32
|
+
- Observation: loop state is still a machine-parseable Markdown document, which is more brittle than the rest of the JSON-capable pipeline.
|
|
33
|
+
Evidence: `modes/loop.md` defines `.bug-hunter/coverage.md` with line-based sections and a checksum format instead of a JSON state file.
|
|
34
|
+
|
|
35
|
+
- Observation: evaluation fixtures still encode the earlier `CRITICAL/HIGH` stopping rule.
|
|
36
|
+
Evidence: `evals/evals.json` case `id: 6` still expects completion once all CRITICAL and HIGH files are done.
|
|
37
|
+
|
|
38
|
+
- Observation: once schema refs become real runtime assets, isolated skill copies must include `schemas/` as well as `scripts/`.
|
|
39
|
+
Evidence: the preflight isolation test needed `schemas/findings.schema.json` and the new schema helper scripts copied into the sandbox to stay representative.
|
|
40
|
+
|
|
41
|
+
- Observation: deduplicated findings now inherit the strongest numeric confidence for the shared `file|lines|claim` key, which changes low-confidence metrics compared with the previous loose merge.
|
|
42
|
+
Evidence: `scripts/bug-hunter-state.cjs` now validates findings before merge and keeps the maximum `confidenceScore` for duplicate keys, which required updating the state test expectation.
|
|
43
|
+
|
|
44
|
+
- Observation: the remaining validation gap closed cleanly once the runner exposed a generic schema-validated phase command instead of baking phase-specific logic into docs.
|
|
45
|
+
Evidence: `scripts/run-bug-hunter.cjs` now exposes `phase`, validates any named artifact after each attempt, and retries malformed Skeptic/Referee/Fix outputs before the phase succeeds.
|
|
46
|
+
|
|
47
|
+
## Decision Log
|
|
48
|
+
|
|
49
|
+
- Decision: use provider-agnostic local JSON schemas as the source of truth, and treat provider-native structured outputs as an optimization layer.
|
|
50
|
+
Rationale: Bug Hunter runs across multiple agent backends and CLIs. Native structured outputs from Claude, OpenAI, and Gemini can improve reliability where available, but the skill must remain correct on backends that only support plain prompting and local validation.
|
|
51
|
+
Date/Author: 2026-03-11 / Codex
|
|
52
|
+
|
|
53
|
+
- Decision: keep Markdown reports, but generate them from validated JSON artifacts.
|
|
54
|
+
Rationale: humans still need readable reports, but machine-state should not depend on brittle line parsing or prompt formatting quirks.
|
|
55
|
+
Date/Author: 2026-03-11 / Codex
|
|
56
|
+
|
|
57
|
+
- Decision: normalize confidence to both `confidence_score` and `confidence_label`.
|
|
58
|
+
Rationale: numeric confidence is required for fix eligibility and consistency checks, while a short label remains useful for readable reports.
|
|
59
|
+
Date/Author: 2026-03-11 / Codex
|
|
60
|
+
|
|
61
|
+
- Decision: migrate loop state from `coverage.md` to `coverage.json` and keep a rendered `coverage.md` for visibility.
|
|
62
|
+
Rationale: the loop is the long-lived state carrier. It benefits the most from strict schema validation, resumability, and safe retries.
|
|
63
|
+
Date/Author: 2026-03-11 / Codex
|
|
64
|
+
|
|
65
|
+
- Decision: ship the schema files as package assets and treat missing schema files as a preflight failure.
|
|
66
|
+
Rationale: payload guards and worker validation now depend on the checked-in schema files at runtime, so an install missing `schemas/` is broken even if the scripts themselves exist.
|
|
67
|
+
Date/Author: 2026-03-11 / Codex
|
|
68
|
+
|
|
69
|
+
## Outcomes & Retrospective
|
|
70
|
+
|
|
71
|
+
This migration milestone is now complete. Bug Hunter rejects malformed `findings.json` artifacts before they reach state, retries the worker when those artifacts are invalid, ships explicit schemas plus a validator CLI, renders Markdown from canonical JSON, writes canonical `coverage.json` loop state with a derived `coverage.md` companion, and now enforces Skeptic/Referee/Fixer artifact validation through the orchestrated `run-bug-hunter.cjs phase` path as well as the manual/local path.
|
|
72
|
+
|
|
73
|
+
## Context and Orientation
|
|
74
|
+
|
|
75
|
+
Bug Hunter is a skill package rooted at `/Users/codex/.agents/skills/bug-hunter`. The important files for this work are spread across prompts, mode documents, helper scripts, and tests.
|
|
76
|
+
|
|
77
|
+
`prompts/hunter.md`, `prompts/skeptic.md`, `prompts/referee.md`, and `prompts/fixer.md` define what each analysis phase writes today. They currently emphasize Markdown output with free-form sections and line-oriented formats. This is the main place where drift enters the system.
|
|
78
|
+
|
|
79
|
+
`scripts/run-bug-hunter.cjs` is the orchestration helper that manages chunk execution, retries, delta expansion, consistency reports, and fix-plan generation. It already understands JSON findings files written by workers. This file is the best anchor for the migration because it already behaves like a JSON pipeline in the tests.
|
|
80
|
+
|
|
81
|
+
`scripts/bug-hunter-state.cjs` stores durable scan state such as chunk progress, a bug ledger, fact cards, consistency information, and fix plans. It currently records findings from JSON files, but it does not validate rich schemas and it accepts incomplete objects as long as basic fields exist.
|
|
82
|
+
|
|
83
|
+
`scripts/payload-guard.cjs` validates worker payloads before launch. Right now it only checks that required top-level fields exist and that `outputSchema` is “an object”. It does not enforce real schemas for either inbound or outbound data.
|
|
84
|
+
|
|
85
|
+
`modes/loop.md` and `modes/fix-loop.md` define the iterative audit loop. They currently store machine state in `.bug-hunter/coverage.md`, which is a Markdown file with line-based sections. That format is readable but brittle and expensive to maintain compared with JSON.
|
|
86
|
+
|
|
87
|
+
`evals/evals.json` and `scripts/tests/*.test.cjs` are the safety net. They currently prove parts of the JSON worker path, but they do not yet enforce full end-to-end structured outputs or the newly required full-queue loop semantics.
|
|
88
|
+
|
|
89
|
+
In this plan, “structured output” means a phase result that conforms to a versioned JSON schema that can be validated locally with no guesswork. “Canonical artifact” means the file every later phase trusts as the source of truth. “Rendered report” means a human-readable Markdown file generated from a validated JSON artifact.
|
|
90
|
+
|
|
91
|
+
## Plan of Work
|
|
92
|
+
|
|
93
|
+
The work starts by defining stable versioned schemas in a new directory, `schemas/`, under the skill root. Create one schema module per artifact: `recon`, `findings`, `skeptic`, `referee`, `coverage`, `fix-report`, and any shared types such as file coverage entries, cross-reference items, STRIDE/CWE metadata, and confidence values. Use plain JSON Schema stored in `.json` files or JavaScript schema builders that output JSON Schema, but keep the final schemas serializable and versioned. Each schema must include a `schemaVersion` field. Confidence must be represented as `confidenceScore` on a numeric 0–100 scale, and optionally `confidenceLabel` derived from it for rendered reports.
|
|
94
|
+
|
|
95
|
+
Next, add a schema runtime helper under `scripts/`, for example `scripts/schema-validate.cjs`, that can validate any named artifact file and print a short machine-readable result. This helper must be used in three places: when generating payloads, when reading worker outputs, and when reading persisted loop state. Expand `scripts/payload-guard.cjs` so the role templates point to real output schemas rather than placeholder `format/version` objects. The guard should reject missing or mismatched schema names before work starts.
|
|
96
|
+
|
|
97
|
+
Then migrate the prompts. `prompts/hunter.md`, `prompts/skeptic.md`, `prompts/referee.md`, and `prompts/fixer.md` should stop treating Markdown as the primary output. Instead they should instruct the agent to write a JSON array or object to the assigned canonical path, and optionally write a rendered Markdown companion file if the assignment requests it. The JSON contract must be concrete. For example, Hunter findings must include `bugId`, `severity`, `category`, `file`, `lines`, `claim`, `evidence`, `runtimeTrigger`, `crossReferences`, and `confidenceScore`. Referee verdicts must include `verdict`, `trueSeverity`, `confidenceScore`, `confidenceLabel`, `verificationMode`, and enriched security fields where applicable. Keep the prose reasoning, but move it into explicitly typed fields such as `analysisSummary` instead of free-form blocks.
|
|
98
|
+
|
|
99
|
+
Once the prompts are changed, update the orchestrator and state layer to consume the new contracts only. In `scripts/run-bug-hunter.cjs`, treat missing worker JSON output as a hard phase failure unless the phase explicitly allows zero results via a valid empty array. Validate every worker output before recording it in state. If validation fails, journal the schema error, mark the chunk or phase as failed, and let the retry logic rerun the worker. In `scripts/bug-hunter-state.cjs`, reject findings entries that omit required fields, and enrich ledger entries with normalized keys such as `confidenceScore`, `severity`, `category`, and `verificationMode`. Do not silently continue when a result is malformed.
|
|
100
|
+
|
|
101
|
+
After the phase artifacts are stable, migrate loop state. Add a new canonical file, `.bug-hunter/coverage.json`, and make it the state the loop reads and writes. It should contain top-level metadata, file coverage entries, cumulative bugs, fix ledger entries, and the current loop status. Keep `.bug-hunter/coverage.md`, but generate it from `coverage.json` after each iteration so humans can still inspect progress. Update `modes/loop.md` and `modes/fix-loop.md` to describe the JSON state as canonical and Markdown as derived.
|
|
102
|
+
|
|
103
|
+
The provider-specific structured-output layer comes next. Add a small capability adapter under `scripts/` or `templates/` that can describe three modes: native structured output supported, native unsupported but JSON prompting available, and plain-text fallback with local validation. Do not make provider-native structured outputs mandatory for correctness. When the backend supports them, use the local schema definitions to generate provider-specific requests. For Claude this means schema-constrained output or strict tool result patterns where available. For OpenAI this means strict structured outputs using JSON Schema and handling refusals or first-schema latency explicitly. For Gemini this means `responseMimeType: application/json` with `responseSchema`. If a backend does not support native structured output, keep the prompt JSON-first and validate locally after the response.
|
|
104
|
+
|
|
105
|
+
Finally, update every test and eval path. Add tests for schema validation failures, malformed worker outputs, missing `confidenceScore`, invalid coverage state, and rendered Markdown generation from JSON. Update `evals/evals.json` to require full queued coverage semantics and the presence of canonical JSON artifacts. Keep the existing worker fixture tests, but add one fully integrated smoke path that simulates a Hunter JSON output, a Skeptic JSON output, a Referee JSON output, and the resulting fix-plan eligibility.
|
|
106
|
+
|
|
107
|
+
## Milestones
|
|
108
|
+
|
|
109
|
+
### Milestone 1: Define the canonical data contracts
|
|
110
|
+
|
|
111
|
+
At the end of this milestone, the repository has explicit versioned schemas for every phase artifact, and a local validator can reject malformed files deterministically. Nothing user-visible changes yet, but the implementation gains a stable foundation. This milestone is complete when a novice can run schema validation against a sample `findings.json` and see success, then remove a required field and see a validation failure with a helpful error.
|
|
112
|
+
|
|
113
|
+
### Milestone 2: Convert prompts and orchestrator to JSON-first phase outputs
|
|
114
|
+
|
|
115
|
+
At the end of this milestone, Hunter, Skeptic, Referee, and Fixer all emit canonical JSON artifacts, and the orchestrator only accepts validated JSON for state updates. Markdown reports still exist, but they are generated from JSON. This milestone is complete when a simulated worker run produces `findings.json`, the orchestrator records it, and a malformed output fails fast with retry instead of silently succeeding.
|
|
116
|
+
|
|
117
|
+
### Milestone 3: Migrate loop state to JSON and align semantics
|
|
118
|
+
|
|
119
|
+
At the end of this milestone, `.bug-hunter/coverage.json` is the canonical loop state, the loop uses full queued coverage semantics, and `.bug-hunter/coverage.md` is a rendered summary. This milestone is complete when a loop simulation can resume from `coverage.json`, continue through queued files, and render a readable Markdown view from the same state.
|
|
120
|
+
|
|
121
|
+
### Milestone 4: Add provider-native structured output adapters and end-to-end safety tests
|
|
122
|
+
|
|
123
|
+
At the end of this milestone, the skill can optionally use native structured outputs for Claude, OpenAI, or Gemini capable backends, but still behaves correctly without them. The tests and evals enforce the new contracts. This milestone is complete when the provider adapter selects the correct mode, malformed outputs are rejected across all supported execution paths, and evals no longer encode the obsolete `CRITICAL/HIGH` stopping rule.
|
|
124
|
+
|
|
125
|
+
## Concrete Steps
|
|
126
|
+
|
|
127
|
+
Work from `/Users/codex/.agents/skills/bug-hunter`.
|
|
128
|
+
|
|
129
|
+
1. Create the schema directory and files.
|
|
130
|
+
|
|
131
|
+
mkdir -p docs/plans schemas
|
|
132
|
+
|
|
133
|
+
Add files such as:
|
|
134
|
+
schemas/findings.schema.json
|
|
135
|
+
schemas/skeptic.schema.json
|
|
136
|
+
schemas/referee.schema.json
|
|
137
|
+
schemas/coverage.schema.json
|
|
138
|
+
schemas/fix-report.schema.json
|
|
139
|
+
schemas/recon.schema.json
|
|
140
|
+
schemas/shared.schema.json
|
|
141
|
+
|
|
142
|
+
Expected result: the `schemas/` directory exists and each schema file includes `schemaVersion`.
|
|
143
|
+
|
|
144
|
+
2. Add a validation helper.
|
|
145
|
+
|
|
146
|
+
Create `scripts/schema-validate.cjs` and teach it:
|
|
147
|
+
- how to load a schema by name
|
|
148
|
+
- how to validate a file path
|
|
149
|
+
- how to print JSON success or JSON error output
|
|
150
|
+
|
|
151
|
+
Expected result:
|
|
152
|
+
|
|
153
|
+
node scripts/schema-validate.cjs findings schemas/examples/findings.valid.json
|
|
154
|
+
{"ok":true,"artifact":"findings"}
|
|
155
|
+
|
|
156
|
+
node scripts/schema-validate.cjs findings schemas/examples/findings.invalid.json
|
|
157
|
+
{"ok":false,"artifact":"findings","errors":["missing required property: claim"]}
|
|
158
|
+
|
|
159
|
+
3. Update `scripts/payload-guard.cjs` and `scripts/run-bug-hunter.cjs`.
|
|
160
|
+
|
|
161
|
+
Replace placeholder `outputSchema` objects with real schema references. Validate worker outputs before calling `record-findings` or any equivalent state write.
|
|
162
|
+
|
|
163
|
+
Expected result: a malformed findings file causes the chunk to fail with a schema error instead of being recorded as partial success.
|
|
164
|
+
|
|
165
|
+
4. Update the prompts and rendered-report flow.
|
|
166
|
+
|
|
167
|
+
Change prompt files so JSON is the primary output. Add a renderer script such as `scripts/render-report.cjs` if needed.
|
|
168
|
+
|
|
169
|
+
Expected result: a run produces both JSON and Markdown, with Markdown fully derivable from JSON.
|
|
170
|
+
|
|
171
|
+
5. Migrate loop state.
|
|
172
|
+
|
|
173
|
+
Add `coverage.json`, update `modes/loop.md` and `modes/fix-loop.md`, and render `coverage.md` from JSON.
|
|
174
|
+
|
|
175
|
+
Expected result: the loop resumes from JSON state and no longer depends on parsing Markdown line structure.
|
|
176
|
+
|
|
177
|
+
6. Update tests and evals.
|
|
178
|
+
|
|
179
|
+
Run:
|
|
180
|
+
|
|
181
|
+
node --test scripts/tests/*.test.cjs
|
|
182
|
+
|
|
183
|
+
Add tests for malformed artifacts, missing confidence scores, bad coverage state, and rendered Markdown output. Update `evals/evals.json` so loop completion requires full queued coverage, not just CRITICAL and HIGH completion.
|
|
184
|
+
|
|
185
|
+
## Validation and Acceptance
|
|
186
|
+
|
|
187
|
+
Acceptance is behavior-based.
|
|
188
|
+
|
|
189
|
+
First, run the script tests from `/Users/codex/.agents/skills/bug-hunter`:
|
|
190
|
+
|
|
191
|
+
node --test scripts/tests/*.test.cjs
|
|
192
|
+
|
|
193
|
+
Expect all tests to pass, including new tests that fail before the migration because the old code accepted malformed outputs or textual confidence.
|
|
194
|
+
|
|
195
|
+
Second, run a local orchestrator smoke path with a valid worker fixture. It must produce canonical JSON output files and a rendered Markdown report. Observe:
|
|
196
|
+
|
|
197
|
+
.bug-hunter/findings.json
|
|
198
|
+
.bug-hunter/referee.json
|
|
199
|
+
.bug-hunter/fix-report.json
|
|
200
|
+
.bug-hunter/coverage.json
|
|
201
|
+
.bug-hunter/report.md
|
|
202
|
+
|
|
203
|
+
Third, deliberately break one phase artifact by removing a required field such as `claim` or `confidenceScore`. Re-run the same smoke path and expect:
|
|
204
|
+
|
|
205
|
+
- the phase fails
|
|
206
|
+
- the journal records a schema validation error
|
|
207
|
+
- state is not updated from the malformed artifact
|
|
208
|
+
- retry logic is allowed to rerun the worker
|
|
209
|
+
|
|
210
|
+
Fourth, run a loop simulation and verify that completion only occurs when every queued scannable file is marked done in `coverage.json`, not merely when CRITICAL and HIGH files are done.
|
|
211
|
+
|
|
212
|
+
## Idempotence and Recovery
|
|
213
|
+
|
|
214
|
+
The migration should be safe to run incrementally. Schema files and validators are additive. During implementation, keep Markdown outputs in parallel with JSON outputs until all consumers are switched over. Do not remove Markdown files until JSON-based rendering and validation are proven.
|
|
215
|
+
|
|
216
|
+
If a phase fails because of schema validation, the safe recovery path is to fix the producer prompt or fixture and rerun the same command. Because the state update happens after validation, malformed outputs should not poison the state file.
|
|
217
|
+
|
|
218
|
+
When migrating loop state, keep a one-time importer from `coverage.md` to `coverage.json` or, if that is too brittle, explicitly start fresh and document that old Markdown loop state is not resumable across the migration. Choose one path and document it in the implementation notes.
|
|
219
|
+
|
|
220
|
+
## Artifacts and Notes
|
|
221
|
+
|
|
222
|
+
The most important implementation artifacts should be:
|
|
223
|
+
|
|
224
|
+
schemas/*.schema.json
|
|
225
|
+
scripts/schema-validate.cjs
|
|
226
|
+
scripts/render-report.cjs
|
|
227
|
+
.bug-hunter/*.json
|
|
228
|
+
.bug-hunter/report.md
|
|
229
|
+
.bug-hunter/coverage.md
|
|
230
|
+
|
|
231
|
+
Expected evidence after completion:
|
|
232
|
+
|
|
233
|
+
$ node scripts/schema-validate.cjs findings .bug-hunter/findings.json
|
|
234
|
+
{"ok":true,"artifact":"findings"}
|
|
235
|
+
|
|
236
|
+
$ node --test scripts/tests/*.test.cjs
|
|
237
|
+
ℹ pass <updated-count>
|
|
238
|
+
ℹ fail 0
|
|
239
|
+
|
|
240
|
+
## Interfaces and Dependencies
|
|
241
|
+
|
|
242
|
+
Define these stable interfaces by the end of the work:
|
|
243
|
+
|
|
244
|
+
In `schemas/findings.schema.json`, define a findings artifact that is an array of finding objects. Each finding object must include:
|
|
245
|
+
|
|
246
|
+
bugId: string
|
|
247
|
+
severity: "Critical" | "Medium" | "Low"
|
|
248
|
+
category: string
|
|
249
|
+
file: string
|
|
250
|
+
lines: string
|
|
251
|
+
claim: string
|
|
252
|
+
evidence: string
|
|
253
|
+
runtimeTrigger: string
|
|
254
|
+
crossReferences: array
|
|
255
|
+
confidenceScore: number
|
|
256
|
+
|
|
257
|
+
In `schemas/referee.schema.json`, define a verdict artifact with:
|
|
258
|
+
|
|
259
|
+
bugId: string
|
|
260
|
+
verdict: "REAL_BUG" | "NOT_A_BUG" | "MANUAL_REVIEW"
|
|
261
|
+
trueSeverity: "Critical" | "Medium" | "Low"
|
|
262
|
+
confidenceScore: number
|
|
263
|
+
confidenceLabel: string
|
|
264
|
+
verificationMode: "INDEPENDENTLY_VERIFIED" | "EVIDENCE_BASED"
|
|
265
|
+
analysisSummary: string
|
|
266
|
+
|
|
267
|
+
In `schemas/coverage.schema.json`, define loop state with:
|
|
268
|
+
|
|
269
|
+
schemaVersion: number
|
|
270
|
+
iteration: number
|
|
271
|
+
status: "IN_PROGRESS" | "COMPLETE"
|
|
272
|
+
files: array of file coverage entries
|
|
273
|
+
bugs: array of confirmed bug summaries
|
|
274
|
+
fixes: array of fix ledger entries
|
|
275
|
+
|
|
276
|
+
In `scripts/schema-validate.cjs`, implement a CLI with:
|
|
277
|
+
|
|
278
|
+
node scripts/schema-validate.cjs <artifact-name> <file-path>
|
|
279
|
+
|
|
280
|
+
In `scripts/render-report.cjs`, implement a CLI that renders Markdown from JSON artifacts:
|
|
281
|
+
|
|
282
|
+
node scripts/render-report.cjs report .bug-hunter/findings.json .bug-hunter/referee.json > .bug-hunter/report.md
|
|
283
|
+
|
|
284
|
+
Provider-native structured output adapters, if added, must consume these local schemas rather than inventing provider-specific contracts.
|
|
285
|
+
|
|
286
|
+
## Change Log For This Plan
|
|
287
|
+
|
|
288
|
+
2026-03-11: Initial ExecPlan created after the structured-output audit. The plan chooses provider-agnostic local schemas as the foundation and treats Claude/OpenAI/Gemini native structured outputs as optional accelerators rather than the source of truth.
|
|
@@ -0,0 +1,193 @@
|
|
|
1
|
+
# Surgical Fix Plan for Confirmed Audit Bugs
|
|
2
|
+
|
|
3
|
+
## Objective
|
|
4
|
+
|
|
5
|
+
Fix the four confirmed runtime bugs without changing the surrounding product behavior, public UX, or broader pipeline design beyond what is necessary for correctness and safety.
|
|
6
|
+
|
|
7
|
+
Confirmed bugs:
|
|
8
|
+
- `BUG-1` — `scripts/run-bug-hunter.cjs`
|
|
9
|
+
- `BUG-2` — `scripts/pr-scope.cjs`
|
|
10
|
+
- `BUG-3` — `scripts/fix-lock.cjs`
|
|
11
|
+
- `BUG-4` — `scripts/code-index.cjs`
|
|
12
|
+
|
|
13
|
+
## Fix order
|
|
14
|
+
|
|
15
|
+
1. `BUG-3` `scripts/fix-lock.cjs`
|
|
16
|
+
2. `BUG-4` `scripts/code-index.cjs`
|
|
17
|
+
3. `BUG-2` `scripts/pr-scope.cjs`
|
|
18
|
+
4. `BUG-1` `scripts/run-bug-hunter.cjs`
|
|
19
|
+
|
|
20
|
+
Rationale:
|
|
21
|
+
- `BUG-3` and `BUG-4` are isolated utility-level correctness fixes with low blast radius.
|
|
22
|
+
- `BUG-2` changes PR scope resolution behavior and needs targeted tests around fallback semantics.
|
|
23
|
+
- `BUG-1` touches orchestration behavior and should land last after the supporting utilities are stable.
|
|
24
|
+
|
|
25
|
+
---
|
|
26
|
+
|
|
27
|
+
## BUG-3 — fix-lock can steal a live lock
|
|
28
|
+
|
|
29
|
+
### Problem
|
|
30
|
+
`acquire()` treats TTL expiry as sufficient proof of staleness and does not check whether the recorded PID is still alive.
|
|
31
|
+
|
|
32
|
+
### Surgical fix
|
|
33
|
+
- Keep the existing lock file format.
|
|
34
|
+
- Change stale recovery logic so a lock is auto-recovered only when:
|
|
35
|
+
- TTL expired **and**
|
|
36
|
+
- owner PID is absent or not alive.
|
|
37
|
+
- If TTL expired but owner is still alive, return a failure payload such as:
|
|
38
|
+
- `reason: "lock-held-by-live-owner"`
|
|
39
|
+
- include `stale: true` and `ownerAlive: true` for observability.
|
|
40
|
+
|
|
41
|
+
### Files
|
|
42
|
+
- `scripts/fix-lock.cjs`
|
|
43
|
+
- tests in `scripts/tests/fix-lock.test.cjs`
|
|
44
|
+
|
|
45
|
+
### Test additions
|
|
46
|
+
- acquiring a fresh lock from another process still fails
|
|
47
|
+
- acquiring an expired lock whose PID is dead succeeds
|
|
48
|
+
- acquiring an expired lock whose PID is alive fails
|
|
49
|
+
- `status` remains consistent with acquire behavior
|
|
50
|
+
|
|
51
|
+
### Risk
|
|
52
|
+
Low. Pure locking behavior change.
|
|
53
|
+
|
|
54
|
+
---
|
|
55
|
+
|
|
56
|
+
## BUG-4 — code-index query-bugs temp file collision
|
|
57
|
+
|
|
58
|
+
### Problem
|
|
59
|
+
`queryBugs()` always writes `.seed-files.tmp.json` in the same directory and only deletes it on success.
|
|
60
|
+
|
|
61
|
+
### Surgical fix
|
|
62
|
+
- Replace fixed temp filename with a unique invocation-scoped filename, e.g. based on:
|
|
63
|
+
- `process.pid`
|
|
64
|
+
- timestamp
|
|
65
|
+
- random suffix
|
|
66
|
+
- Wrap temp-file lifecycle in `try/finally` so cleanup runs even if `query()` throws.
|
|
67
|
+
- Preserve current command contract and output shape.
|
|
68
|
+
|
|
69
|
+
### Files
|
|
70
|
+
- `scripts/code-index.cjs`
|
|
71
|
+
- tests in `scripts/tests/code-index.test.cjs`
|
|
72
|
+
|
|
73
|
+
### Test additions
|
|
74
|
+
- `query-bugs` cleans up temp file after success
|
|
75
|
+
- `query-bugs` cleans up temp file after a thrown query path
|
|
76
|
+
- parallel invocations do not reuse the same temp file name
|
|
77
|
+
|
|
78
|
+
### Risk
|
|
79
|
+
Low. Local helper behavior only.
|
|
80
|
+
|
|
81
|
+
---
|
|
82
|
+
|
|
83
|
+
## BUG-2 — pr-scope silent wrong-base fallback
|
|
84
|
+
|
|
85
|
+
### Problem
|
|
86
|
+
For `selector === "current"`, any `gh` failure falls back to `git diff <base or main>...HEAD` and reports success. This can silently produce the wrong review scope.
|
|
87
|
+
|
|
88
|
+
### Surgical fix
|
|
89
|
+
Preferred minimal behavior:
|
|
90
|
+
- Keep git fallback only for `current`.
|
|
91
|
+
- Before fallback, determine base branch more safely:
|
|
92
|
+
1. explicit `--base` if supplied
|
|
93
|
+
2. repo default branch if discoverable
|
|
94
|
+
3. otherwise fail explicitly instead of assuming `main`
|
|
95
|
+
- If `gh` fails and no trustworthy base is available, return an error rather than a successful but potentially wrong scope.
|
|
96
|
+
|
|
97
|
+
### Implementation notes
|
|
98
|
+
- Add a small helper to resolve default branch via git when possible, e.g. from:
|
|
99
|
+
- `refs/remotes/origin/HEAD`
|
|
100
|
+
- or another safe git source
|
|
101
|
+
- Do **not** broaden fallback for numbered/recent PRs.
|
|
102
|
+
- Preserve existing JSON output contract, but add metadata when fallback is used.
|
|
103
|
+
|
|
104
|
+
### Files
|
|
105
|
+
- `scripts/pr-scope.cjs`
|
|
106
|
+
- tests in `scripts/tests/pr-scope.test.cjs`
|
|
107
|
+
|
|
108
|
+
### Test additions
|
|
109
|
+
- `current` with explicit `--base` still falls back correctly
|
|
110
|
+
- `current` with discoverable default branch falls back correctly
|
|
111
|
+
- `current` with no trustworthy base fails explicitly
|
|
112
|
+
- `recent` and numbered PRs still require GitHub metadata
|
|
113
|
+
|
|
114
|
+
### Risk
|
|
115
|
+
Medium. Scope-selection behavior changes and could affect user workflows, but the change is correctness-oriented and bounded.
|
|
116
|
+
|
|
117
|
+
---
|
|
118
|
+
|
|
119
|
+
## BUG-1 — fix strategy ignored by executable fix queue
|
|
120
|
+
|
|
121
|
+
### Problem
|
|
122
|
+
`fix-strategy.json` is generated, but `buildFixPlan()` still computes eligibility directly from confidence alone. Strategy classes such as `manual-review`, `larger-refactor`, and `architectural-remediation` do not actually gate execution.
|
|
123
|
+
|
|
124
|
+
### Surgical fix
|
|
125
|
+
- Keep `fix-strategy.json` as the source of truth for execution eligibility.
|
|
126
|
+
- Update the executable queue builder so only findings/clusters marked safe for autofix enter:
|
|
127
|
+
- `safe-autofix`
|
|
128
|
+
- and `autofixEligible === true`
|
|
129
|
+
- Ensure `manual-review`, `larger-refactor`, and `architectural-remediation` never flow into canary/rollout.
|
|
130
|
+
- Preserve current `fix-plan.json` shape as much as possible to minimize downstream breakage.
|
|
131
|
+
|
|
132
|
+
### Recommended implementation shape
|
|
133
|
+
Option A, lowest risk:
|
|
134
|
+
- Refactor `buildFixPlan()` to accept preclassified entries from `buildFixStrategy()`.
|
|
135
|
+
- Derive eligible/canary/rollout only from strategy entries where `autofixEligible === true`.
|
|
136
|
+
|
|
137
|
+
Also fix cluster-stage ambiguity:
|
|
138
|
+
- Either include `executionStage` in the cluster grouping key, or
|
|
139
|
+
- compute cluster stage conservatively from all entries instead of taking `entries[0]`.
|
|
140
|
+
|
|
141
|
+
### Files
|
|
142
|
+
- `scripts/run-bug-hunter.cjs`
|
|
143
|
+
- tests in `scripts/tests/run-bug-hunter.test.cjs`
|
|
144
|
+
- possibly `schemas/fix-strategy.schema.json` only if contract refinement is needed
|
|
145
|
+
|
|
146
|
+
### Test additions
|
|
147
|
+
- high-confidence `architectural-remediation` finding does not enter `fixPlan.canary/rollout`
|
|
148
|
+
- high-confidence `larger-refactor` finding does not enter executable queue
|
|
149
|
+
- `safe-autofix` findings still enter canary/rollout
|
|
150
|
+
- mixed-stage safe-autofix entries in same directory do not collapse incorrectly
|
|
151
|
+
|
|
152
|
+
### Risk
|
|
153
|
+
Medium-high. This changes executable orchestration, but still within the intended design and existing artifact model.
|
|
154
|
+
|
|
155
|
+
---
|
|
156
|
+
|
|
157
|
+
## Verification plan
|
|
158
|
+
|
|
159
|
+
Run after each bug fix if practical, and again at the end:
|
|
160
|
+
|
|
161
|
+
```bash
|
|
162
|
+
node --test scripts/tests/*.test.cjs
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
Recommended focused sequence during implementation:
|
|
166
|
+
|
|
167
|
+
```bash
|
|
168
|
+
node --test scripts/tests/fix-lock.test.cjs
|
|
169
|
+
node --test scripts/tests/code-index.test.cjs
|
|
170
|
+
node --test scripts/tests/pr-scope.test.cjs
|
|
171
|
+
node --test scripts/tests/run-bug-hunter.test.cjs
|
|
172
|
+
node --test scripts/tests/*.test.cjs
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
## Definition of done
|
|
176
|
+
|
|
177
|
+
- [x] All 4 confirmed bugs have targeted code fixes.
|
|
178
|
+
- [x] Regression tests exist for each bug.
|
|
179
|
+
- [x] Full script test suite passes.
|
|
180
|
+
- [x] No public CLI contract is changed except where necessary to avoid silent wrong behavior.
|
|
181
|
+
- [x] `fix-strategy` becomes behaviorally authoritative for execution gating, not just informational.
|
|
182
|
+
|
|
183
|
+
## Outcome
|
|
184
|
+
|
|
185
|
+
Implemented and verified on 2026-03-12.
|
|
186
|
+
|
|
187
|
+
Fresh verification evidence:
|
|
188
|
+
|
|
189
|
+
```bash
|
|
190
|
+
node --test scripts/tests/*.test.cjs
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
Result: 44/44 tests passing.
|
|
@@ -0,0 +1,59 @@
|
|
|
1
|
+
# Enterprise Security Pack End-to-End Integration Plan
|
|
2
|
+
|
|
3
|
+
## Objective
|
|
4
|
+
|
|
5
|
+
Make Bug Hunter's bundled local security skills fully end-to-end connected, portable, and enterprise-grade.
|
|
6
|
+
|
|
7
|
+
The bundled local skills already exist under `skills/`, but the main Bug Hunter orchestration flow does not yet actively route into them. This plan closes that gap by wiring the main `SKILL.md`, documentation, tests, and evals so the companion skills are not just packaged assets — they become part of the operating system of the product.
|
|
8
|
+
|
|
9
|
+
## Target outcomes
|
|
10
|
+
|
|
11
|
+
1. Main Bug Hunter flow explicitly routes into bundled local security skills when relevant.
|
|
12
|
+
2. Security entrypoints are easy to invoke and enterprise-friendly.
|
|
13
|
+
3. Docs, tests, and evals all reflect the integrated flow.
|
|
14
|
+
4. The repository remains fully portable with no external marketplace dependency.
|
|
15
|
+
5. After integration, run a focused Bug Hunter audit on the repository, fix any real bugs found, and summarize the net result.
|
|
16
|
+
|
|
17
|
+
## Integration model
|
|
18
|
+
|
|
19
|
+
Bug Hunter remains the top-level orchestrator.
|
|
20
|
+
|
|
21
|
+
Bundled local skills become capability modules:
|
|
22
|
+
- `skills/commit-security-scan/` → diff-scoped PR/commit/staged security review
|
|
23
|
+
- `skills/security-review/` → full security workflow (threat model + code + deps + validation)
|
|
24
|
+
- `skills/threat-model-generation/` → authoritative threat model bootstrap/refresh
|
|
25
|
+
- `skills/vulnerability-validation/` → exploitability/reachability/CVSS/PoC validation for security findings
|
|
26
|
+
|
|
27
|
+
The main skill should load these on demand from local paths and keep all artifacts under `.bug-hunter/`.
|
|
28
|
+
|
|
29
|
+
## Work plan
|
|
30
|
+
|
|
31
|
+
### Milestone 1 — Main skill routing
|
|
32
|
+
- Add security-oriented flags and aliases to `SKILL.md` / `README.md`
|
|
33
|
+
- Add explicit routing rules for when to read bundled local security skills
|
|
34
|
+
- Make threat model generation explicitly delegate to bundled `threat-model-generation`
|
|
35
|
+
- Make PR security review explicitly delegate to bundled `commit-security-scan`
|
|
36
|
+
- Make severe security validation explicitly delegate to bundled `vulnerability-validation`
|
|
37
|
+
- Make full security audit explicitly delegate to bundled `security-review`
|
|
38
|
+
|
|
39
|
+
### Milestone 2 — Enterprise UX surface
|
|
40
|
+
- Add enterprise-grade usage examples and a security-pack section in docs
|
|
41
|
+
- Keep behavior portable and artifact-native (`.bug-hunter/*` only)
|
|
42
|
+
|
|
43
|
+
### Milestone 3 — Guardrails
|
|
44
|
+
- Add regression tests proving the main skill references and exposes the bundled skills
|
|
45
|
+
- Add evals for the new end-to-end security flows
|
|
46
|
+
|
|
47
|
+
### Milestone 4 — Cross verification and self-audit
|
|
48
|
+
- Run the full script test suite
|
|
49
|
+
- Run a focused Bug Hunter audit on the repository
|
|
50
|
+
- Fix any real bugs uncovered by that audit
|
|
51
|
+
- Summarize all shipped changes briefly
|
|
52
|
+
|
|
53
|
+
## Definition of done
|
|
54
|
+
|
|
55
|
+
- Main `SKILL.md` actively routes to the bundled local security skills
|
|
56
|
+
- `README.md` documents the integrated security pack as a real workflow, not just a packaged extra
|
|
57
|
+
- tests and evals cover the integrated paths
|
|
58
|
+
- full test suite passes
|
|
59
|
+
- self-audit completes and any confirmed bugs are fixed
|
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
# Local Security Skills Integration Plan
|
|
2
|
+
|
|
3
|
+
## Objective
|
|
4
|
+
|
|
5
|
+
Vendor the security-engineer marketplace capabilities into Bug Hunter as local, portable companion skills so the repository is self-contained and does not depend on external machine-specific skill paths.
|
|
6
|
+
|
|
7
|
+
Target local skills:
|
|
8
|
+
- `skills/commit-security-scan/`
|
|
9
|
+
- `skills/security-review/`
|
|
10
|
+
- `skills/threat-model-generation/`
|
|
11
|
+
- `skills/vulnerability-validation/`
|
|
12
|
+
|
|
13
|
+
## Design
|
|
14
|
+
|
|
15
|
+
Use Bug Hunter as the orchestrator and package the imported capabilities as local skills with Bug Hunter-native artifact paths and schemas.
|
|
16
|
+
|
|
17
|
+
Principles:
|
|
18
|
+
- No references to `.factory/` or external marketplace paths
|
|
19
|
+
- Reuse Bug Hunter-native artifacts under `.bug-hunter/`
|
|
20
|
+
- Keep skill bodies focused on capability/workflow; keep runtime logic in existing prompts/scripts
|
|
21
|
+
- Make the new skills portable by including them in the package `files` list and documenting them in the repo
|
|
22
|
+
|
|
23
|
+
## Work items
|
|
24
|
+
|
|
25
|
+
1. Create local skill directories with adapted `SKILL.md` files
|
|
26
|
+
2. Point all skill outputs/inputs to `.bug-hunter/*` artifacts and existing Bug Hunter concepts
|
|
27
|
+
3. Add a packaging/regression test to verify the local skills are present and packaged
|
|
28
|
+
4. Add `skills/` to `package.json` publish files
|
|
29
|
+
5. Document the bundled companion skills in `README.md`
|
|
30
|
+
6. Update `CHANGELOG.md`
|
|
31
|
+
7. Run tests
|
|
32
|
+
|
|
33
|
+
## Definition of done
|
|
34
|
+
|
|
35
|
+
- `skills/` exists with the four local security skills
|
|
36
|
+
- no vendored skill references point to `.factory/` paths
|
|
37
|
+
- package metadata includes `skills/`
|
|
38
|
+
- tests verify the packaged skills exist
|
|
39
|
+
- docs explain the bundled local security pack
|