@vyuhlabs/dxkit 2.11.1 → 2.13.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (73) hide show
  1. package/CHANGELOG.md +67 -0
  2. package/README.md +192 -279
  3. package/dist/baseline/check.d.ts +7 -0
  4. package/dist/baseline/check.d.ts.map +1 -1
  5. package/dist/baseline/check.js +3 -1
  6. package/dist/baseline/check.js.map +1 -1
  7. package/dist/baseline/entry-to-located.d.ts +36 -11
  8. package/dist/baseline/entry-to-located.d.ts.map +1 -1
  9. package/dist/baseline/entry-to-located.js +69 -13
  10. package/dist/baseline/entry-to-located.js.map +1 -1
  11. package/dist/baseline/finding-identity.d.ts +32 -0
  12. package/dist/baseline/finding-identity.d.ts.map +1 -1
  13. package/dist/baseline/finding-identity.js +17 -2
  14. package/dist/baseline/finding-identity.js.map +1 -1
  15. package/dist/baseline/producers/index.d.ts.map +1 -1
  16. package/dist/baseline/producers/index.js +5 -1
  17. package/dist/baseline/producers/index.js.map +1 -1
  18. package/dist/baseline/producers/quality.d.ts +15 -2
  19. package/dist/baseline/producers/quality.d.ts.map +1 -1
  20. package/dist/baseline/producers/quality.js +20 -2
  21. package/dist/baseline/producers/quality.js.map +1 -1
  22. package/dist/baseline/producers/stale-allow.d.ts +13 -2
  23. package/dist/baseline/producers/stale-allow.d.ts.map +1 -1
  24. package/dist/baseline/producers/stale-allow.js +9 -2
  25. package/dist/baseline/producers/stale-allow.js.map +1 -1
  26. package/dist/baseline/types.d.ts +12 -0
  27. package/dist/baseline/types.d.ts.map +1 -1
  28. package/dist/cli.d.ts.map +1 -1
  29. package/dist/cli.js +81 -1
  30. package/dist/cli.js.map +1 -1
  31. package/dist/generator.d.ts.map +1 -1
  32. package/dist/generator.js +5 -0
  33. package/dist/generator.js.map +1 -1
  34. package/dist/loop/demo.d.ts +11 -0
  35. package/dist/loop/demo.d.ts.map +1 -0
  36. package/dist/loop/demo.js +156 -0
  37. package/dist/loop/demo.js.map +1 -0
  38. package/dist/loop/doctor.d.ts +37 -0
  39. package/dist/loop/doctor.d.ts.map +1 -0
  40. package/dist/loop/doctor.js +294 -0
  41. package/dist/loop/doctor.js.map +1 -0
  42. package/dist/loop/ledger-cli.d.ts +7 -0
  43. package/dist/loop/ledger-cli.d.ts.map +1 -0
  44. package/dist/loop/ledger-cli.js +95 -0
  45. package/dist/loop/ledger-cli.js.map +1 -0
  46. package/dist/loop/ledger.d.ts +95 -0
  47. package/dist/loop/ledger.d.ts.map +1 -0
  48. package/dist/loop/ledger.js +201 -0
  49. package/dist/loop/ledger.js.map +1 -0
  50. package/dist/loop/policy.d.ts +35 -0
  51. package/dist/loop/policy.d.ts.map +1 -0
  52. package/dist/loop/policy.js +151 -0
  53. package/dist/loop/policy.js.map +1 -0
  54. package/dist/loop/scaffold.d.ts +26 -0
  55. package/dist/loop/scaffold.d.ts.map +1 -0
  56. package/dist/loop/scaffold.js +221 -0
  57. package/dist/loop/scaffold.js.map +1 -0
  58. package/dist/loop/stop-gate.d.ts +71 -0
  59. package/dist/loop/stop-gate.d.ts.map +1 -0
  60. package/dist/loop/stop-gate.js +295 -0
  61. package/dist/loop/stop-gate.js.map +1 -0
  62. package/dist/types.d.ts +4 -0
  63. package/dist/types.d.ts.map +1 -1
  64. package/dist/update.d.ts.map +1 -1
  65. package/dist/update.js +9 -0
  66. package/dist/update.js.map +1 -1
  67. package/package.json +1 -1
  68. package/templates/.claude/skills/dxkit-config/SKILL.md +17 -0
  69. package/templates/.claude/skills/dxkit-init/SKILL.md +1 -0
  70. package/templates/.claude/skills/dxkit-learn/SKILL.md +17 -0
  71. package/templates/.claude/skills/dxkit-loop/SKILL.md +114 -0
  72. package/templates/.claude/skills/dxkit-onboard/SKILL.md +2 -0
  73. package/templates/.claude/skills/dxkit-update/SKILL.md +3 -0
package/CHANGELOG.md CHANGED
@@ -7,6 +7,73 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
 
8
8
  ## [Unreleased]
9
9
 
10
+ ## [2.13.0] - 2026-06-18
11
+
12
+ ### Loop pack — a deterministic Stop-gate for autonomous coding loops
13
+
14
+ When Claude Code runs in an autonomous loop (it keeps working until it
15
+ decides to stop), the new loop pack stops it from declaring "done" while it
16
+ has introduced net-new findings. It re-runs the guardrail on every Stop and
17
+ feeds any net-new findings back to the model for repair. The value is
18
+ predictability, not new detection — it bounds the "loop shipped debt and
19
+ never fixed it" failure mode using the findings, baseline, and identity
20
+ contract dxkit already computes.
21
+
22
+ - **`vyuh-dxkit init --claude-loop`** registers the Stop-gate hook. The
23
+ install is **additive**: it deep-merges the hook into an existing
24
+ `.claude/settings.json` (preserving your other hooks + permissions) and
25
+ appends a sentinel-delimited managed block to `CLAUDE.md` (never touching
26
+ your prose). Opt-in even under `--full`, because it registers a hook that
27
+ blocks the agent from stopping. Re-applied by `vyuh-dxkit update` on repos
28
+ that opted in.
29
+ - **Loop-scoped presets.** A `loop.preset` in `.dxkit/policy.json` decides
30
+ what blocks the loop: `security-only` (default — net-new secrets +
31
+ crit/high security + reachable dependency vulns) or `full-debt` (also
32
+ blocks test-gap + quality). It is read **only by the Stop-gate**; your CI
33
+ / PR guardrail always uses the full policy, so the loop posture can't
34
+ silently weaken your CI gate. `security-only` is the default because a
35
+ block in a loop tells the model to *fix* the finding, and open-ended debt
36
+ (write tests / refactor until clear) would make an unattended agent grind.
37
+ - **`vyuh-dxkit loop doctor`** — preflight that verifies a loop is wired
38
+ safely before an unattended run (baseline present, Stop hook registered,
39
+ guardrail runnable, posture). Catches the silent-failure class: an
40
+ unregistered hook never fires, so the loop would run with no gate and no
41
+ error. Exits non-zero so a CI loop-setup step can gate on it.
42
+ - **`vyuh-dxkit loop ledger [show|summarize|clear]`** — an append-only audit
43
+ trail of every Stop event (`.dxkit/loop/ledger.jsonl`): blocked vs
44
+ allowed, net-new counts, and repaired-after-block sessions.
45
+ - **New `dxkit-loop` skill** plus loop-aware updates to `dxkit-config`,
46
+ `dxkit-learn`, `dxkit-update`, `dxkit-onboard`, and `dxkit-init` so the
47
+ loop is set up and operated conversationally through Claude Code.
48
+ - No baseline re-creation is needed; existing baselines and allowlists are
49
+ unaffected. The loop pack is opt-in — existing installs are unchanged
50
+ until they run `init --claude-loop`.
51
+
52
+ ## [2.12.0] - 2026-06-17
53
+
54
+ ### Guardrail: benign line shifts no longer read as net-new
55
+
56
+ - **Fixed a guardrail false-positive where inserting lines above a duplicate
57
+ block flagged every duplicate as net-new debt.** Duplicate-block findings
58
+ carry an identity derived from their exact start lines, but the matcher was
59
+ not relocating them across a change — so a comment added near the top of a
60
+ file shifted the blocks and the guardrail reported them as new. Duplicate
61
+ findings (and range-anchored coverage gaps) are now relocated like every
62
+ other line-anchored finding, so routine churn no longer trips the gate.
63
+ - **Closed the same gap for shallow clones and force-pushed baselines.**
64
+ Relocation across a change normally uses git history; where that history
65
+ isn't available, the matcher falls back to a content hash of the finding's
66
+ surroundings. Duplicate and orphaned-allowlist (`stale-allow`) findings now
67
+ carry that content hash too, matching the protection secret/code findings
68
+ already had — so the gate stays correct even on a depth-1 CI checkout.
69
+ - **Hardened the matcher against this whole class of bug.** A finding whose
70
+ identity moves with line position must be relocatable; new contract tests
71
+ derive that property automatically for every finding kind and fail if a kind
72
+ is ever added (or changed) that can drift without a way to relocate it.
73
+ - No baseline re-creation is needed — the finding identity is unchanged, so
74
+ existing baselines and allowlists keep matching. `vyuh-dxkit update` is a
75
+ no-op for this release beyond the version bump.
76
+
10
77
  ## [2.11.1] - 2026-06-17
11
78
 
12
79
  ### Line-aware passive context hook
package/README.md CHANGED
@@ -1,143 +1,226 @@
1
1
  # dxkit
2
2
 
3
- **AI writes the code. dxkit helps ship it clean.**
3
+ **A deterministic Stop-gate for autonomous coding loops.**
4
4
 
5
- _Deterministic guardrails for any codebase. Brownfield-friendly by default._
5
+ Coding agents keep editing until they decide to stop. Tests and linters catch
6
+ broken code, but they do not know whether the agent made the repo worse than
7
+ the baseline. So loops can quietly ship new secrets, untested paths, and other
8
+ detector-backed regressions, then report success.
6
9
 
7
- dxkit scores your codebase deterministically, baselines today's findings, and gates every push against net-new regressions. It ships conversational skills that walk agents (and humans) through fixes. Existing tech debt stays grandfathered. Nothing runs on an LLM. Everything runs locally.
10
+ In our loop benchmark, vanilla Claude Code-style loops stopped with net-new
11
+ debt in **11 of 16 runs**. A prompt that told the agent to self-check still
12
+ escaped **9 of 16**. With dxkit's Stop-gate, we observed **0 of 16** escapes:
13
+ when the loop tried to stop dirty, dxkit blocked, handed back the exact net-new
14
+ finding, and the agent repaired before stopping clean.
8
15
 
9
16
  <p align="center">
10
- <img src=".github/assets/guardrail-demo.gif" width="760" alt="A git push blocked by the dxkit pre-push guardrail: 2 net-new regressions block the push while 644 pre-existing findings stay grandfathered." />
17
+ <img src=".github/assets/loop-stop-gate-demo.gif" width="820" alt="dxkit's Stop-gate blocks a coding-agent loop on a net-new critical dependency vulnerability, the agent bumps the version, and the gate goes clean." />
11
18
  </p>
12
19
 
20
+ dxkit does not reinvent detection. It runs trusted open source scanners
21
+ (gitleaks, Semgrep, OSV, npm audit, and more), and it can ingest results from
22
+ Snyk and CodeQL. What it adds is the piece those tools were not built for: a
23
+ deterministic check, on every stop, of whether this change introduced a new
24
+ finding compared with a baseline.
25
+
13
26
  ```bash
14
- npm init @vyuhlabs/dxkit
27
+ npx -y @vyuhlabs/dxkit demo loop-guardrail # see it in 5 seconds, no API key, no setup
15
28
  ```
16
29
 
30
+ Local. Offline. No model in the gate. Existing debt stays grandfathered. Only
31
+ net-new regressions block.
32
+
33
+ [Watch it block and repair](#watch-it-block-and-repair) · [Read the benchmark](docs/benchmarks.md) · [Try it on your repo](#try-it-locally)
34
+
17
35
  <p>
18
- <a href="https://www.npmjs.com/package/@vyuhlabs/dxkit">
19
- <img alt="npm version" src="https://img.shields.io/npm/v/@vyuhlabs/dxkit">
20
- </a>
21
- <img alt="license" src="https://img.shields.io/github/license/vyuh-labs/dxkit">
22
- <img alt="deterministic" src="https://img.shields.io/badge/scoring-deterministic-blue">
23
- <img alt="brownfield" src="https://img.shields.io/badge/brownfield-baseline%20guardrails-orange">
24
- <img alt="local-first" src="https://img.shields.io/badge/local-first-green">
36
+ <a href="https://www.npmjs.com/package/@vyuhlabs/dxkit"><img alt="npm" src="https://img.shields.io/npm/v/@vyuhlabs/dxkit"></a>
37
+ <img alt="license: MIT" src="https://img.shields.io/badge/license-MIT-green">
38
+ <img alt="deterministic gate" src="https://img.shields.io/badge/gate-deterministic-blue">
39
+ <img alt="local-first" src="https://img.shields.io/badge/local--first-success">
25
40
  </p>
26
41
 
27
42
  ---
28
43
 
29
- ## The problem
44
+ ## The problem: loops do not know when they made things worse
30
45
 
31
- Codebases drift downward in slow ways that tests do not catch.
46
+ An autonomous loop runs until the agent decides it is done. The only checks in
47
+ that loop today are tests and linters, and those catch broken code, not
48
+ regressed code. There is no notion of "worse than the baseline." So an agent
49
+ can add a feature, leave a new untested path or a hardcoded credential behind,
50
+ run the tests, see green, and declare success.
32
51
 
33
- A typical Friday. Your team ships a fix. CI passes. Review approves. Two weeks later, an auditor finds a new hardcoded secret in the diff, three new untested branches, and a previously-clean file that grew to 800 lines with three TODOs sprinkled in. None of it failed a test, because no test covered those things.
52
+ In our benchmark this happened in most vanilla runs, and telling the agent to
53
+ check its own work only helped a little.
34
54
 
35
- Now multiply this by every AI agent your team uses. Agents write more code than humans can review. Some of it is fine. Some of it is slop that looks fine but quietly degrades the codebase.
55
+ ## What dxkit does
36
56
 
37
- The conventional fix is "block any new finding via static analysis." That fails on real codebases for a predictable reason:
57
+ 1. **Baseline today's debt.** `baseline create` records every current finding,
58
+ so pre-existing issues are grandfathered and never block.
59
+ 2. **Run a deterministic Stop-gate on every stop.** A Claude Code Stop hook
60
+ re-runs the guardrail against that baseline. Same input gives the same
61
+ verdict, in seconds, offline, with no model in the loop.
62
+ 3. **Feed net-new findings back to the agent.** If the change introduced a
63
+ finding, the gate blocks the stop and hands the agent the exact finding to
64
+ fix: do not refresh the baseline, do not touch unrelated debt, fix what this
65
+ branch introduced. The loop stops only when clean.
38
66
 
39
- - Block every finding, and your 5-year-old repo lights up with hundreds of pre-existing issues. The team disables the gate within a week.
40
- - Block no findings, and the gate is theater. Nothing changes.
67
+ ## Who this is for
41
68
 
42
- You need an objective gate that only fires on what is actually new. That is the gap dxkit fills.
69
+ Use dxkit if you let coding agents:
43
70
 
44
- ---
71
+ - run unattended or semi-attended,
72
+ - fix CI or review comments in loops,
73
+ - touch brownfield repos that already carry debt,
74
+ - or work where "new debt" matters more than "all debt."
45
75
 
46
- ## How dxkit solves it
76
+ ## Built on tools you already trust
47
77
 
48
- Three ideas working together.
78
+ dxkit is an orchestration and enforcement layer, not another scanner. It runs
79
+ established open source tools and treats their output as one stream:
49
80
 
50
- ### 1. Capture today's state as a baseline
81
+ - secrets: gitleaks
82
+ - code patterns: Semgrep
83
+ - dependency vulnerabilities: OSV and npm audit
84
+ - duplication, size, and the code graph: jscpd, cloc, and graphify
51
85
 
52
- Before dxkit blocks anything, it snapshots every existing finding in your repo and fingerprints them. The fingerprints survive renames, line shifts from formatter runs, and small unrelated edits. Cross-tool overlaps (gitleaks and semgrep flagging the same line) collapse to one finding.
86
+ For deep interprocedural analysis, it ingests findings from **Snyk Code** and
87
+ **CodeQL** (or any SARIF file), fingerprints them the same way as native
88
+ findings, and runs them through the same baseline and gate. You keep the
89
+ detectors you already have. dxkit makes their findings enforceable inside CI
90
+ and inside the agent loop.
53
91
 
54
- From this moment forward, the gate only fires on net-new regressions. Your existing debt is grandfathered. The team fixes old issues at their own pace. The gate stays useful because it stays reasonable.
92
+ | Layer | Examples | Job |
93
+ | --------- | ------------------------------------------------------ | ------------------------------------------------------- |
94
+ | Detection | gitleaks, Semgrep, OSV, npm audit, Snyk, CodeQL, SARIF | Find issues |
95
+ | dxkit | baseline, fingerprint matcher, Stop-gate, loop ledger | Decide whether this change introduced something net-new |
96
+ | Agent | Claude Code or another coding loop | Repair the exact finding and try to stop again |
55
97
 
56
- Three modes for the baseline file:
98
+ ## Watch it block and repair
57
99
 
58
- - `committed-full`: rich entries committed to git. Default for private repos.
59
- - `committed-sanitized`: stripped to fingerprint plus kind. For compliance-conscious teams.
60
- - `ref-based`: no committed file at all. Prior side recomputed from a git ref via `git worktree add`. Default for public repos. Zero disclosure surface.
100
+ ```text
101
+ checkout-service · loop behind the dxkit Stop-gate
102
+ task: add a debounce helper using lodash 4.17.4
103
+ claude ▸ Added a debounce helper using lodash 4.17.4. Done.
104
+ ✗ dxkit Stop-gate ▸ BLOCKED: 1 net-new finding
105
+ lodash 4.17.4: critical dependency vuln (GHSA-JF85-CPCP-J695)
106
+ claude ▸ Bumped lodash to 4.17.21 and re-checked. Done.
107
+ ✓ dxkit Stop-gate ▸ CLEAN the loop may stop.
108
+ ```
61
109
 
62
- ### 2. Score the codebase deterministically
110
+ Recorded from a real run on a synthetic repo, shortened for readability.
111
+ Blocked and repaired inside the same warm loop.
63
112
 
64
- dxkit produces a 0 to 100 score across six dimensions: Security, Code Quality, Tests, Documentation, Maintainability, Developer Experience.
113
+ ## Try it locally
65
114
 
66
- The score has four properties:
115
+ See the gate with no API key, no Claude Code, and no setup:
67
116
 
68
- | Property | What it means |
69
- | --------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
70
- | **Deterministic** | Same code yields the same score every time. No LLM in the grading path. Reproducible across machines, runs, and CI. Auditable. |
71
- | **Comparable** | Two codebases of similar quality produce similar scores. Surface tricks do not move the needle. Adding empty comments does not improve Documentation if the code is not actually documented. |
72
- | **Severity-weighted** | A critical security finding moves the score far more than a TODO comment. Penalties are anchored to real-world impact via CVSS for security and ratio thresholds for tests, coverage, file size, and other dimensions. |
73
- | **Actionable** | Every deduction names the file, the line, and the recommended fix. Output is structured JSON. Agents and humans read the same thing. The "what to do next" lives in the score itself. |
117
+ ```bash
118
+ npx -y @vyuhlabs/dxkit demo loop-guardrail
119
+ ```
74
120
 
75
- ### 3. Fix findings at reduced token cost
121
+ It runs the real gate over an example finding and shows what it feeds the
122
+ agent: block, repair, clean.
76
123
 
77
- Detection is only half the job. dxkit builds a deterministic code graph of the repo (its symbols, call edges, and clustered modules), so fixing is cheap too. A coding agent works from that structure ("what calls this? what breaks if I change it?") instead of re-reading whole files, and every finding in a detailed report already carries its blast radius: the files that depend on it. The `dxkit-action` skill runs the fix, re-scores, and confirms the gate clears. Same result, far fewer tokens.
124
+ Wire it into your real Claude Code loop:
78
125
 
79
- ### What you get from the combination
126
+ ```bash
127
+ npx @vyuhlabs/dxkit init --claude-loop # registers the Stop hook (additive: your settings are kept)
128
+ npx @vyuhlabs/dxkit baseline create # grandfather today's findings
129
+ npx @vyuhlabs/dxkit loop doctor # verify the gate is wired safely
130
+ # then run Claude Code as you normally would. The Stop-gate fires on every stop.
131
+ npx @vyuhlabs/dxkit loop ledger summarize # afterwards: blocked vs allowed, repaired-after-block
132
+ ```
80
133
 
81
- A score on its own is a number. A baseline on its own grandfathers the past. Together they produce an objective stop signal you can trust.
134
+ ### Presets: what blocks the loop
82
135
 
83
136
  ```text
84
- Today: 16/100 E 644 findings, all baselined
85
- Next PR: 16/100 E 644 persisted, 0 new. Gate passes.
86
- Bad PR: 14/100 E 644 persisted, 2 new high-severity. Gate blocks.
137
+ security-only (default) secrets and critical or high vulnerabilities. Bounded, must-fix, cheap to gate.
138
+ full-debt (opt-in) also gates test gaps and maintainability regressions. Repairs can be expensive.
87
139
  ```
88
140
 
89
- The score does not lie. The baseline keeps it useful on real codebases. The combination works the same for humans, AI agents, and CI runners. That is the part that scales. And once the gate fires, the code graph makes acting on it cheap: agents fix from the structure rather than reading file after file.
141
+ The default is `security-only`. The headline escape-rate benchmark used
142
+ `full-debt` (it gated both the secret trap and the test-gap trap); the default
143
+ install starts narrower so a first run does not trap users in expensive
144
+ test-generation loops. Switch with `init --claude-loop --loop-preset full-debt`.
90
145
 
91
- ---
146
+ ## Graph context: reducing the exploration tail
92
147
 
93
- ## 60-second demo
148
+ The Stop-gate controls what the loop is allowed to ship. The code graph helps
149
+ control how far the loop wanders. When dxkit scaffolds a repo, it builds a code
150
+ graph and feeds the agent structural context: callers, callees, and blast
151
+ radius. The agent gets a map before it starts grepping through unfamiliar code.
94
152
 
95
- ```text
96
- $ npm init @vyuhlabs/dxkit
97
- Created: 14 files
98
- Git hooks: installed 1 file(s)
99
- .githooks/pre-push
100
- ✓ Devcontainer: installed 3 file(s)
101
- ✓ CI guardrails workflow: installed 1 file(s)
102
- .github/workflows/dxkit-guardrails.yml
103
- ✓ Done! Claude Code now has full project context.
104
- → Next: run `vyuh-dxkit baseline create` to capture today's state.
105
-
106
- $ npx vyuh-dxkit baseline create
107
- → Baseline mode=committed-full (auto: visibility not detectable via gh; defaulting to private posture)
108
- ✓ Wrote .dxkit/baselines/main.json — 644 findings, salt: deterministic (208.9s)
109
-
110
- $ npx vyuh-dxkit guardrail check
111
- ## Guardrail: PASSED
112
- No changes from baseline (644 pairs checked).
113
- ```
153
+ The honest result from our benchmarks is predictable spend, not guaranteed
154
+ cheaper spend. On a large repo the median was roughly tied, but the worst-case
155
+ session used about **57% fewer tokens** and the variance was **roughly halved**.
156
+ On a small repo the overhead was about zero. The graph caps the expensive tail.
157
+ It does not promise a lower average.
114
158
 
115
- Later, an innocent-looking PR slips in a regression. The pre-push hook fires:
159
+ This is a different axis from detection. Snyk, SonarQube, and CodeQL tell you
160
+ what is wrong. They do not give the agent a map of the code or bound how much it
161
+ spends finding its way around. dxkit does both: the gate bounds what the loop
162
+ ships, the graph bounds what the loop costs.
116
163
 
117
- ```text
118
- $ git push
119
- [hook] vyuh-dxkit guardrail check
120
- ## Guardrail: BLOCKED
121
- 2 new regressions found.
122
-
123
- | Status | Kind | Severity | Location | Reason |
124
- |---|---|---|---|---|
125
- | added | secret | high | src/config/secrets.ts:42 | gitleaks/aws-access-key |
126
- | added | code | medium | src/handlers/exec.ts:17 | semgrep/eval-use |
127
-
128
- 644 pre-existing findings persisted. Only the new changes blocked you.
129
- Fix or allowlist with `npx vyuh-dxkit allowlist add ...`
130
- ```
164
+ ## The numbers
131
165
 
132
- The 644 pre-existing findings sit quietly. The 2 net-new ones stop the push.
166
+ Three independent benchmark results, one theme: dxkit makes agent work more
167
+ predictable.
133
168
 
134
- ---
169
+ | Layer | What it bounds | Observed result |
170
+ | -------------------------- | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------- |
171
+ | **Stop-gate** | unsafe final state | vanilla loops escaped **11/16** times, prompt-only checklist escaped **9/16**, dxkit escaped **0/16** |
172
+ | **Deterministic identity** | false "net-new" findings under churn | **100% catch / 0% false-block** on seeded gate tests; **0 false net-new** on tested line shifts and renames |
173
+ | **Graph context** | large-repo exploration tails | median roughly tied, but large-repo mean tokens **30% lower**, worst case **57% lower**, variance roughly halved |
174
+
175
+ > **Benchmark caveats:** the loop-safety study uses controlled synthetic tasks
176
+ > plus real-repo validation, detector-backed findings, and Sonnet runs. It is
177
+ > not a CVE corpus, not a claim of better detection, and not a guarantee that
178
+ > dxkit catches every possible bug. The claim is narrower: for findings the
179
+ > detector observes, dxkit gives the loop a deterministic net-new stop decision.
180
+
181
+ Full methodology, raw artifacts, and the rest of the caveats are in
182
+ **[docs/benchmarks.md](docs/benchmarks.md)**.
183
+
184
+ ## What dxkit is, and is not
185
+
186
+ **It is a deterministic verification layer.** It baselines today's findings,
187
+ fingerprints them across churn, and blocks only net-new regressions.
188
+
189
+ **It is not a scanner replacement.** It runs and ingests scanners (gitleaks,
190
+ Semgrep, CodeQL, Snyk, SARIF) and makes their findings enforceable. It does not
191
+ claim to find more bugs than they do.
192
+
193
+ **It is not an LLM judge.** No model decides whether the gate passes. The model
194
+ can repair findings. The gate itself is deterministic, and the prompt does not
195
+ grow as the baseline grows.
196
+
197
+ **It is not a guarantee of safe code.** It blocks detector-backed net-new
198
+ findings it can observe. You still need tests, review, scanners, and judgment.
199
+
200
+ ## Why not just Snyk, SonarQube, or CodeQL?
201
+
202
+ Use them. dxkit can ingest their findings. The difference is tempo and control,
203
+ not detection. Cloud scanners are strong detection engines, and they usually
204
+ run on a CI or PR cadence. A coding-agent loop needs a local stop decision
205
+ every time the agent tries to declare done.
135
206
 
136
- ## Features
207
+ | Loop Stop-gate need | dxkit | Cloud or CI scanners |
208
+ | ----------------------------------------------------------- | ----- | -------------------------------------- |
209
+ | Runs locally on every stop, in seconds | yes | usually CI or cloud cadence |
210
+ | Can run without network or auth | yes | usually requires network or auth |
211
+ | Grandfathers existing debt | yes | tool-dependent |
212
+ | Feeds the exact block reason back to the warm agent session | yes | usually a human-facing dashboard or PR |
137
213
 
138
- ### Eight first-class language packs
214
+ The goal is not to replace scanners. It is to make their findings enforceable
215
+ at the speed of the agent loop.
139
216
 
140
- TypeScript / JavaScript, Python, Go, Rust, C# / .NET, Java, Kotlin, Ruby. Each pack ships per-ecosystem analyzers: semgrep rulesets, dep-vuln scanners, license tools, lint adapters. Polyglot repos get unified reports without configuration.
217
+ ## Beyond loops
218
+
219
+ The same deterministic core powers the rest of dxkit: pre-push and CI
220
+ guardrails, brownfield baselines, durable finding identity, SARIF, CodeQL, and
221
+ Snyk ingest, a six-dimension health report, code-graph context, and a set of
222
+ Claude Code skills. It covers TypeScript / JavaScript, Python, Go, Rust, C# /
223
+ .NET, Java, Kotlin, and Ruby. See **[the docs](docs/README.md)**.
141
224
 
142
225
  <details>
143
226
  <summary><strong>Per-pack capabilities</strong> (click to expand)</summary>
@@ -165,203 +248,33 @@ so it does not inflate the Code Quality score.
165
248
 
166
249
  </details>
167
250
 
168
- ### The matcher
169
-
170
- Multi-axis fingerprints (location, domain, content, semantic) pair findings across runs even when files were renamed, lines shifted, tools changed versions, or the branch was force-pushed. When location fails, the matcher falls back to git-aware diff lookup, then content hash, then identity-only multiset match. Every pair carries a confidence score and a reason chain.
171
-
172
- ### Per-finding suppression
173
-
174
- Five typed categories: `false-positive`, `test-fixture`, `mitigated-externally`, `accepted-risk`, `deferred`. Each entry requires a reason. Categories that fade over time require an expiry.
175
-
176
- Two surfaces:
177
-
178
- - Inline annotations: `// dxkit-allow:test-fixture reason="example placeholder"`
179
- - File-level: `.dxkit/allowlist.json`, audited via `vyuh-dxkit allowlist audit`
180
-
181
- Orphaned annotations become their own findings. The TypeScript `@ts-expect-error` model applied to suppressions. Prevents the graveyard of stale allowlist entries.
182
-
183
- ### AI-agent integration
184
-
185
- dxkit ships a suite of Claude Code skills under `.claude/skills/dxkit-*`. They wrap the CLI in conversational flows:
186
-
187
- | Skill | What it does |
188
- | --------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
189
- | `dxkit-onboard` | Walks a customer through the full first-install journey |
190
- | `dxkit-reports` | Runs analyzers and explains the output |
191
- | `dxkit-action` | Reads a report, prioritizes findings, plans and runs fixes, re-verifies |
192
- | `dxkit-ingest` | Brings external SAST findings (Snyk Code, CodeQL, SARIF) into dxkit |
193
- | `dxkit-fix` | Repairs a broken install from doctor output |
194
- | `dxkit-allowlist` | Manages the suppression lifecycle: audit, remove, prune, export to Snyk |
195
- | `dxkit-test` | Writes the missing tests to close gaps + raise the Tests score |
196
- | `dxkit-pr` | Opens a PR with a diff-grounded body + dxkit signals + reviewer checklist |
197
- | `dxkit-feature`, `dxkit-docs`, `dxkit-hooks`, `dxkit-config`, `dxkit-learn`, `dxkit-update`, `dxkit-init` | Focused flows |
198
-
199
- `AGENTS.md` (the open standard read by Codex, Cursor, Aider, and others) also ships in every install. The skill flows are Claude Code-specific today; the AGENTS.md context is portable.
200
-
201
- Why this matters for AI workflows: when an agent fixes a bug, you need an objective signal that says "yes, fixed cleanly" or "fix introduced four new regressions." dxkit's deterministic score plus baseline guardrail produces that signal. The agent reads the same JSON envelope a human reads, runs the verify step itself, and stops when clean.
202
-
203
- ### Code-graph context: fix at reduced token cost
204
-
205
- dxkit builds a deterministic code graph of your repo (its symbols, call edges, and clustered modules) using graphify (the `graphifyy` Python package). What matters is what an agent does with it. Instead of discovering structure by grepping around and reading whole files, the agent gets just the relevant slice:
206
-
207
- - **`vyuh-dxkit context <query>`** (and an opt-in PreToolUse hook) hand an agent a slim structural map: the relevant symbols, where they live, and what calls them. It navigates by the graph instead of re-reading files, which is the same work at a fraction of the tokens.
208
- - **`--graph-context`** writes each finding's module and blast radius (which files call into it) straight into the detailed report, so the `dxkit-action` fix skill can plan the change, and know which callers to re-test, without rediscovering structure first.
209
- - **`vyuh-dxkit explore`** and a dashboard graph tab let humans ask the same graph what the repo does, where a feature lives, and which files are load-bearing.
210
-
211
- This is an additive, fail-open layer. When the graph is missing, or a language's call edges can't be resolved, every command behaves exactly as it did before. It's reliable on TypeScript, Python, and Go. Where the call graph can't be resolved (C#), blast radius is suppressed rather than faked, so a "no callers" reading is never mistaken for "safe to change."
212
-
213
- ### Connect findings and PRs to the people who know the code
214
-
215
- A finding or a PR is more actionable when you know who to ask. dxkit grounds that in an **active-owner model** — recency-weighted git history, scoped to who is still active, with bots and departed contributors filtered, the change author excluded, and a bus-factor signal.
216
-
217
- - **`vyuh-dxkit reviewers`** suggests reviewers for a change, ranked by active ownership of the touched files and blended with `CODEOWNERS` — a better signal than a platform's naive last-touch suggestion. The `dxkit-pr` skill folds it into the PR body.
218
- - **`--attribute`** adds a "who to ask" column to a detailed report: a pre-existing finding is traced to its current owner (an inactive author is routed to whoever owns the file now). It's opt-in and historical — a net-new finding is introduced by your own change.
219
-
220
- Output is names + GitHub @handles, never raw emails — the @handle is both privacy-safe and @-mentionable.
221
-
222
- ### Deep SAST: interprocedural findings from any engine
223
-
224
- dxkit's bundled SAST (community semgrep) is intraprocedural — it can't follow tainted data across function boundaries, so it misses the path-traversal / information-exposure / SSRF / injection class that an interprocedural engine like Snyk Code or CodeQL catches. dxkit doesn't try to re-detect that class; it **ingests** it and makes it first-class.
225
-
226
- - **`vyuh-dxkit ingest --from-snyk`** brings in your Snyk Code findings and works on every Snyk plan: it reads the REST API quota-free where you have it (Enterprise), and on Free/Team plans automatically falls back to `snyk code test` (one test per run). **`--sarif <file>`** ingests SARIF from any engine; **`--codeql`** runs CodeQL on demand (open-source / GitHub Advanced Security).
227
- - Ingested findings enter the same pipeline as native ones: fingerprinted and deduped, written to the baseline, enforced by the guardrail, and graph-linked under `--graph-context` so the `dxkit-action` fix loop sees blast radius + callers — context the source engine's own autofix doesn't have.
228
- - The findings live in a committed `.dxkit/external/` snapshot, so the engine token is needed only at ingest time (ideally one on-demand CI job) — every developer and CI run reads the snapshot without it.
229
-
230
- dxkit isn't competing with the detection engine — it's the governance + agentic-fix layer on top of whichever one you can run. The `dxkit-ingest` skill walks through setup and picks the engine license-aware (your own Snyk for private repos; CodeQL for open source / GHAS).
251
+ ## Reproduce the benchmark
231
252
 
232
- ### Reproducible environments
233
-
234
- Per-stack devcontainer with only the languages your project uses. Scanner toolchain auto-installed. Install scripts for AI agent CLIs (auth stays user-owned). Codespaces prebuilds wire via `vyuh-dxkit setup-prebuild` so cold-start drops from ~7 minutes to ~30 seconds.
235
-
236
- ### Public-repo safe baselines
237
-
238
- The `ref-based` mode commits no baseline file. The guardrail check recomputes the prior side at check time from a git ref via `git worktree add`. Zero disclosure surface. File paths, package names, and advisory IDs all stay out of git. Auto-picked for public repos via `gh repo view --json visibility`.
239
-
240
- ---
241
-
242
- ## Quickstart
243
-
244
- ```bash
245
- # Canonical first install
246
- npm init @vyuhlabs/dxkit
247
-
248
- # Capture today's state
249
- npx vyuh-dxkit baseline create
250
-
251
- # Verify the install
252
- npx vyuh-dxkit doctor
253
-
254
- # Commit and ship
255
- git add . && git commit -m "chore: enable dxkit" && git push
256
-
257
- # Optional but recommended
258
- npx vyuh-dxkit setup-branch-protection # mark guardrail as required CI check
259
- npx vyuh-dxkit setup-prebuild # Codespaces prebuild
260
- ```
261
-
262
- À la carte if you only want specific pieces:
253
+ The deterministic tier runs offline, so you do not have to trust our numbers:
263
254
 
264
255
  ```bash
265
- npx vyuh-dxkit init --with-dxkit-agents # just the dxkit-* Claude skills + AGENTS.md
266
- npx vyuh-dxkit init --with-hooks # just the pre-push hook
267
- npx vyuh-dxkit init --with-precommit-hook # add pre-commit (slow on large repos)
268
- npx vyuh-dxkit init --with-devcontainer # just the per-stack devcontainer
269
- npx vyuh-dxkit init --with-ci # just the PR-gate workflow
256
+ npx @vyuhlabs/dxkit demo loop-guardrail # the gate, end to end, no API key
257
+ npx @vyuhlabs/dxkit init --claude-loop
258
+ npx @vyuhlabs/dxkit baseline create
259
+ npx @vyuhlabs/dxkit loop doctor
270
260
  ```
271
261
 
272
- ---
273
-
274
- ## What dxkit analyzes
275
-
276
- | Dimension | Tools | What it catches |
277
- | -------------------- | --------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------- |
278
- | Security | gitleaks, semgrep, osv-scanner, npm-audit, pip-audit, govulncheck, cargo-audit, dotnet vulnerable, bundle-audit | Secrets, dep vulnerabilities, insecure patterns, TLS bypass |
279
- | Code Quality | cloc, jscpd, graphify, lint adapters | File size, duplication, complexity, hygiene markers |
280
- | Tests | coverage adapters per pack, test-file detector | Missing tests, degraded tests, coverage gaps |
281
- | Documentation | doc-comment ratio, README presence | Inline doc coverage, project-level docs |
282
- | Maintainability | graphify call-graph metrics | God files, dead imports, cohesion, communities |
283
- | Developer Experience | git hook detection, CI workflow detection, manifest presence | Pre-push hooks, CI quality gates, environment reproducibility |
284
-
285
- Each analyzer reports raw findings. dxkit aggregates, deduplicates across tools, and scores deterministically.
286
-
287
- ---
288
-
289
- ## Brownfield vs greenfield
290
-
291
- | | Greenfield (day 1) | Brownfield (years of debt) |
292
- | ---------------- | -------------------------------------- | ------------------------------------------------- |
293
- | Baseline | Near-zero on capture | Captures today's debt as floor |
294
- | Behavior | Every regression matters from commit 1 | Existing debt grandfathered; net-new blocks |
295
- | Cleanup pressure | Stay clean, easily | Improve incrementally; no required cleanup sprint |
296
-
297
- The status taxonomy that drives gate decisions:
298
-
299
- | Status | Meaning | Default |
300
- | ------------------- | ----------------------------------------- | ---------- |
301
- | `added` | Net-new finding introduced by this change | **blocks** |
302
- | `relocated` | Same finding, moved (line drift, rename) | passes |
303
- | `persisted` | Same finding, same place. Pre-existing. | passes |
304
- | `removed` / `fixed` | Was there, now gone | passes |
305
- | `tooling_drift` | New because scanner version changed | warns |
306
- | `config_drift` | New because dxkit config changed | warns |
307
- | `uncertain` | Below confidence threshold | warns |
262
+ Methodology and raw artifacts: **[docs/benchmarks.md](docs/benchmarks.md)**.
308
263
 
309
- Customize via [`.dxkit/policy.json`](docs/configuration/policy.md).
264
+ ## Credits
310
265
 
311
- ---
312
-
313
- ## Safety and trust
314
-
315
- - **Local-first.** Every scan runs on the developer's machine. Nothing leaves the repo. No telemetry. No phone-home.
316
- - **No LLM in the grading path.** Scores come from deterministic analyzers and arithmetic. Reproducible. Auditable. The only way to improve a score is to write better code.
317
- - **Sigstore provenance.** Every npm release is signed via OIDC from GitHub Actions. Verify with `npm audit signatures`.
318
- - **Open source.** MIT licensed. Inspect every score derivation.
319
-
320
- ---
321
-
322
- ## Real-world validation
323
-
324
- dxkit ships against pinned production codebases across all eight language packs. Every release runs a cross-stack walkthrough on a polyglot reference repo (TypeScript + Python) and a .NET reference repo before tagging. The cross-stack regression suite is part of CI.
325
-
326
- Recent ship validation (`@vyuhlabs/dxkit@2.6.0`, 2026-05-23):
327
-
328
- - 1904 tests across 110 files
329
- - License findings dropped 73% on a 600-source-file polyglot codebase after the 2.6 baseline polish
330
- - New `ref-based` mode verified end-to-end on both reference stacks
331
-
332
- ---
333
-
334
- ## Documentation
335
-
336
- **Start here**:
337
-
338
- - [Getting started](docs/getting-started.md): full walkthrough from install to first guardrail check
339
- - [CHANGELOG](CHANGELOG.md): release notes. Latest is [2.6.0](https://github.com/vyuh-labs/dxkit/releases/tag/v2.6.0)
340
-
341
- **Depth**:
342
-
343
- - [Why dxkit](docs/why-dxkit.md): rationale, comparison vs SonarQube/Snyk/Semgrep/etc., open methodology
344
- - [Architecture](docs/ARCHITECTURE.md): data flow, the git-aware matcher, fingerprint axes
345
- - [Scoring methodology](docs/SCORING.md): how each dimension is computed, citations
346
- - [Roadmap](docs/roadmap.md): shipped vs planned
347
-
348
- **Reference**:
349
-
350
- - [Command reference](docs/README.md): every subcommand at a glance
351
- - [`baseline`](docs/commands/baseline.md): capture, show, modes
352
- - [`guardrail`](docs/commands/guardrail.md): check, classify, render
353
- - [`allowlist`](docs/commands/allowlist.md): per-finding suppression
354
- - [`.dxkit/policy.json`](docs/configuration/policy.md): tune what blocks vs warns
355
- - [Reporting issues](docs/commands/issue.md): `vyuh-dxkit issue --type=...`
356
-
357
- ---
358
-
359
- ## Contributing
360
-
361
- See [CONTRIBUTING.md](CONTRIBUTING.md). The project follows architectural rules in [CLAUDE.md](CLAUDE.md). Adding a new language pack, a new finding kind, or a new scoring dimension each have one-page recipes.
362
-
363
- ---
266
+ dxkit stands on excellent open source tools. It orchestrates them, it does not
267
+ replace them. Thank you to the maintainers of
268
+ [graphify](https://github.com/safishamsi/graphify) (the code graph),
269
+ [gitleaks](https://github.com/gitleaks/gitleaks),
270
+ [Semgrep](https://github.com/semgrep/semgrep),
271
+ [OSV-Scanner](https://github.com/google/osv-scanner),
272
+ [jscpd](https://github.com/kucherenko/jscpd), and
273
+ [cloc](https://github.com/AlDanial/cloc). Each tool is installed separately and
274
+ keeps its own license.
364
275
 
365
- ## License
276
+ ## Contributing and roadmap
366
277
 
367
- MIT. See [LICENSE](LICENSE).
278
+ - Contributing guide: [CONTRIBUTING.md](CONTRIBUTING.md)
279
+ - Roadmap: [docs/roadmap.md](docs/roadmap.md)
280
+ - License: MIT
@@ -65,6 +65,13 @@ export interface RunGuardrailCheckOptions {
65
65
  * `<cwd>/.dxkit/policy.json` is auto-loaded if it exists; otherwise
66
66
  * the compiled-in defaults apply. */
67
67
  readonly policyPath?: string;
68
+ /** Pre-resolved policy override. When supplied, the orchestrator uses
69
+ * it verbatim and skips disk resolution (`policyPath` /
70
+ * `.dxkit/policy.json`). This is the seam the loop Stop-gate uses to
71
+ * inject its loop-scoped preset policy (see
72
+ * `src/loop/policy.ts:resolveLoopPolicy`) WITHOUT changing what the
73
+ * CI guardrail resolves. CI / `baseline check` never set this. */
74
+ readonly policy?: BrownfieldPolicy;
68
75
  /** Forwarded to the underlying analyzers for per-tool timing logs. */
69
76
  readonly verbose?: boolean;
70
77
  /** Pre-resolved baseline mode. When supplied, the orchestrator