@harness-engineering/cli 1.6.0 → 1.6.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/agents/personas/code-reviewer.yaml +2 -0
- package/dist/agents/personas/codebase-health-analyst.yaml +5 -0
- package/dist/agents/personas/performance-guardian.yaml +26 -0
- package/dist/agents/personas/security-reviewer.yaml +35 -0
- package/dist/agents/skills/claude-code/harness-autopilot/SKILL.md +494 -0
- package/dist/agents/skills/claude-code/harness-autopilot/skill.yaml +52 -0
- package/dist/agents/skills/claude-code/harness-code-review/SKILL.md +15 -0
- package/dist/agents/skills/claude-code/harness-integrity/SKILL.md +20 -6
- package/dist/agents/skills/claude-code/harness-perf/SKILL.md +231 -0
- package/dist/agents/skills/claude-code/harness-perf/skill.yaml +47 -0
- package/dist/agents/skills/claude-code/harness-perf-tdd/SKILL.md +236 -0
- package/dist/agents/skills/claude-code/harness-perf-tdd/skill.yaml +47 -0
- package/dist/agents/skills/claude-code/harness-pre-commit-review/SKILL.md +27 -2
- package/dist/agents/skills/claude-code/harness-release-readiness/SKILL.md +657 -0
- package/dist/agents/skills/claude-code/harness-release-readiness/skill.yaml +57 -0
- package/dist/agents/skills/claude-code/harness-security-review/SKILL.md +206 -0
- package/dist/agents/skills/claude-code/harness-security-review/skill.yaml +50 -0
- package/dist/agents/skills/claude-code/harness-security-scan/SKILL.md +102 -0
- package/dist/agents/skills/claude-code/harness-security-scan/skill.yaml +41 -0
- package/dist/agents/skills/claude-code/harness-state-management/SKILL.md +22 -8
- package/dist/agents/skills/gemini-cli/harness-autopilot/SKILL.md +494 -0
- package/dist/agents/skills/gemini-cli/harness-autopilot/skill.yaml +52 -0
- package/dist/agents/skills/gemini-cli/harness-perf/SKILL.md +231 -0
- package/dist/agents/skills/gemini-cli/harness-perf/skill.yaml +47 -0
- package/dist/agents/skills/gemini-cli/harness-perf-tdd/SKILL.md +236 -0
- package/dist/agents/skills/gemini-cli/harness-perf-tdd/skill.yaml +47 -0
- package/dist/agents/skills/gemini-cli/harness-release-readiness/SKILL.md +657 -0
- package/dist/agents/skills/gemini-cli/harness-release-readiness/skill.yaml +57 -0
- package/dist/agents/skills/gemini-cli/harness-security-review/skill.yaml +50 -0
- package/dist/agents/skills/gemini-cli/harness-security-scan/SKILL.md +102 -0
- package/dist/agents/skills/gemini-cli/harness-security-scan/skill.yaml +41 -0
- package/dist/bin/harness.js +1 -1
- package/dist/{chunk-VS4OTOKZ.js → chunk-O6NEKDYP.js} +789 -299
- package/dist/index.js +1 -1
- package/package.json +2 -2
|
@@ -32,6 +32,15 @@ Invoke `harness-verify` to run the mechanical quick gate.
|
|
|
32
32
|
3. **If ALL three checks FAIL**, stop here. Do not proceed to Phase 2. The code is not in a reviewable state.
|
|
33
33
|
4. If at least one check passes (or some are skipped), proceed to Phase 2.
|
|
34
34
|
|
|
35
|
+
### Phase 1.5: SECURITY SCAN
|
|
36
|
+
|
|
37
|
+
Run the built-in security scanner as a mechanical check between verification and AI review.
|
|
38
|
+
|
|
39
|
+
1. Use `run_security_scan` MCP tool against the project root (or changed files if available).
|
|
40
|
+
2. Capture findings by severity: errors, warnings, info.
|
|
41
|
+
3. **Error-severity security findings are blocking** — they cause the overall integrity check to FAIL, same as a test failure.
|
|
42
|
+
4. Warning/info findings are included in the report but do not block.
|
|
43
|
+
|
|
35
44
|
### Phase 2: REVIEW
|
|
36
45
|
|
|
37
46
|
Run change-type-aware AI review using `harness-code-review`.
|
|
@@ -40,6 +49,7 @@ Run change-type-aware AI review using `harness-code-review`.
|
|
|
40
49
|
2. Invoke `harness-code-review` with the detected change type.
|
|
41
50
|
3. Capture the review findings: suggestions, blocking issues, and notes.
|
|
42
51
|
4. A review finding is "blocking" only if it would cause a runtime error, data loss, or security vulnerability.
|
|
52
|
+
5. The AI review includes a security-focused pass that complements the mechanical scanner — checking for semantic issues like user input flowing to dangerous sinks across function boundaries.
|
|
43
53
|
|
|
44
54
|
### Phase 3: REPORT
|
|
45
55
|
|
|
@@ -47,10 +57,11 @@ Produce a unified integrity report in this exact format:
|
|
|
47
57
|
|
|
48
58
|
```
|
|
49
59
|
Integrity Check: [PASS/FAIL]
|
|
50
|
-
- Tests:
|
|
51
|
-
- Lint:
|
|
52
|
-
- Types:
|
|
53
|
-
-
|
|
60
|
+
- Tests: [PASS/FAIL/SKIPPED]
|
|
61
|
+
- Lint: [PASS/FAIL/SKIPPED]
|
|
62
|
+
- Types: [PASS/FAIL/SKIPPED]
|
|
63
|
+
- Security: [PASS/WARN/FAIL] ([count] errors, [count] warnings)
|
|
64
|
+
- Review: [PASS/FAIL] ([count] suggestions, [count] blocking)
|
|
54
65
|
|
|
55
66
|
Overall: [PASS/FAIL]
|
|
56
67
|
```
|
|
@@ -90,19 +101,22 @@ Integrity Check: PASS
|
|
|
90
101
|
- Tests: PASS (42/42)
|
|
91
102
|
- Lint: PASS (0 warnings)
|
|
92
103
|
- Types: PASS
|
|
104
|
+
- Security: PASS (0 errors, 0 warnings)
|
|
93
105
|
- Review: 1 suggestion (0 blocking)
|
|
94
106
|
```
|
|
95
107
|
|
|
96
|
-
### Example: Blocking Issue
|
|
108
|
+
### Example: Security Blocking Issue
|
|
97
109
|
|
|
98
110
|
```
|
|
99
111
|
Integrity Check: FAIL
|
|
100
112
|
- Tests: PASS (42/42)
|
|
101
113
|
- Lint: PASS
|
|
102
114
|
- Types: PASS
|
|
115
|
+
- Security: FAIL (1 error, 0 warnings)
|
|
116
|
+
- [SEC-INJ-002] src/auth/login.ts:42 — SQL query built with string concatenation
|
|
103
117
|
- Review: 3 findings (1 blocking)
|
|
104
118
|
|
|
105
|
-
Blocking: [
|
|
119
|
+
Blocking: [SEC-INJ-002] SQL injection — user input passed directly to query without parameterization.
|
|
106
120
|
```
|
|
107
121
|
|
|
108
122
|
## Gates
|
|
@@ -0,0 +1,231 @@
|
|
|
1
|
+
# Harness Perf
|
|
2
|
+
|
|
3
|
+
> Performance enforcement and benchmark management. Tier-based gates block commits and merges based on complexity, coupling, and runtime regression severity.
|
|
4
|
+
|
|
5
|
+
## When to Use
|
|
6
|
+
|
|
7
|
+
- After code changes to verify performance hasn't degraded
|
|
8
|
+
- On PRs to enforce performance budgets
|
|
9
|
+
- For periodic performance audits
|
|
10
|
+
- NOT for initial development (use harness-tdd for that)
|
|
11
|
+
- NOT for brainstorming performance improvements (use harness-brainstorming)
|
|
12
|
+
|
|
13
|
+
## Process
|
|
14
|
+
|
|
15
|
+
### Iron Law
|
|
16
|
+
|
|
17
|
+
**No merge with Tier 1 performance violations. No commit with cyclomatic complexity exceeding the error threshold.**
|
|
18
|
+
|
|
19
|
+
Tier 1 violations are non-negotiable blockers. If a Tier 1 violation is detected, execution halts and the violation must be resolved before any further progress. Do not attempt workarounds.
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
### Phase 1: ANALYZE — Structural and Coupling Checks
|
|
24
|
+
|
|
25
|
+
1. **Run structural checks.** Execute `harness check-perf --structural` to compute complexity metrics for all changed files:
|
|
26
|
+
- Cyclomatic complexity per function
|
|
27
|
+
- Nesting depth per function
|
|
28
|
+
- File length (lines of code)
|
|
29
|
+
- Parameter count per function
|
|
30
|
+
|
|
31
|
+
2. **Run coupling checks.** Execute `harness check-perf --coupling` to compute coupling metrics:
|
|
32
|
+
- Fan-in and fan-out per module
|
|
33
|
+
- Afferent and efferent coupling
|
|
34
|
+
- Transitive dependency depth
|
|
35
|
+
- Circular dependency detection
|
|
36
|
+
|
|
37
|
+
3. **Classify violations by tier:**
|
|
38
|
+
- **Tier 1 (error, block commit):** Cyclomatic complexity > 15, circular dependencies, hotspot in top 5%
|
|
39
|
+
- **Tier 2 (warning, block merge):** Complexity > 10, nesting > 4, fan-out > 10, size budget exceeded
|
|
40
|
+
- **Tier 3 (info, no gate):** File length > 300, fan-in > 20, transitive depth > 30
|
|
41
|
+
|
|
42
|
+
4. **If Tier 1 violations found,** report them immediately and STOP. Do not proceed to benchmarks. The violations must be fixed first.
|
|
43
|
+
|
|
44
|
+
5. **If no violations found,** proceed to Phase 2.
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
### Phase 2: BENCHMARK — Runtime Performance
|
|
49
|
+
|
|
50
|
+
This phase runs only when `.bench.ts` files exist in the project. If none are found, skip to Phase 3.
|
|
51
|
+
|
|
52
|
+
1. **Check for benchmark files.** Scan the project for `*.bench.ts` files. If none exist, skip this phase entirely.
|
|
53
|
+
|
|
54
|
+
2. **Verify clean working tree.** Run `git status --porcelain`. If there are uncommitted changes, STOP. Benchmarks on dirty trees produce unreliable results.
|
|
55
|
+
|
|
56
|
+
3. **Run benchmarks.** Execute `harness perf bench` to run all benchmark suites.
|
|
57
|
+
|
|
58
|
+
4. **Load baselines.** Read `.harness/perf/baselines.json` for previous benchmark results. If no baselines exist, treat this as a baseline-capture run.
|
|
59
|
+
|
|
60
|
+
5. **Compare results against baselines** using the `RegressionDetector`:
|
|
61
|
+
- Calculate percentage change for each benchmark
|
|
62
|
+
- Apply noise margin (default: 3%) before flagging regressions
|
|
63
|
+
- Distinguish between critical-path and non-critical-path benchmarks
|
|
64
|
+
|
|
65
|
+
6. **Resolve critical paths** via `CriticalPathResolver`:
|
|
66
|
+
- Check `@perf-critical` annotations in source files
|
|
67
|
+
- Check graph fan-in data (functions called by many consumers)
|
|
68
|
+
- Functions in the critical path set have stricter thresholds
|
|
69
|
+
|
|
70
|
+
7. **Flag regressions by tier:**
|
|
71
|
+
- **Tier 1:** >5% regression on a critical path benchmark
|
|
72
|
+
- **Tier 2:** >10% regression on a non-critical-path benchmark
|
|
73
|
+
- **Tier 3:** >5% regression on a non-critical-path benchmark (within noise margin consideration)
|
|
74
|
+
|
|
75
|
+
8. **If this is a baseline-capture run,** report results without regression comparison. Recommend running `harness perf baselines update` to persist.
|
|
76
|
+
|
|
77
|
+
---
|
|
78
|
+
|
|
79
|
+
### Phase 3: REPORT — Generate Performance Report
|
|
80
|
+
|
|
81
|
+
1. **Format violations by tier.** Present Tier 1 violations first (most severe), then Tier 2, then Tier 3. Each violation entry includes:
|
|
82
|
+
- File path and function name
|
|
83
|
+
- Metric name and current value
|
|
84
|
+
- Threshold that was exceeded
|
|
85
|
+
- Tier classification and gate impact
|
|
86
|
+
|
|
87
|
+
2. **Show hotspot scores** for top functions if knowledge graph data is available:
|
|
88
|
+
- Query the graph for functions with high churn + high fan-in
|
|
89
|
+
- Rank by composite hotspot score
|
|
90
|
+
- Flag any hotspots that also have performance violations
|
|
91
|
+
|
|
92
|
+
3. **Show benchmark regression summary** if benchmarks ran:
|
|
93
|
+
- Table of benchmark name, baseline, current, delta percentage, tier
|
|
94
|
+
- Highlight critical-path benchmarks with a marker
|
|
95
|
+
- Show noise margin and whether the regression exceeds it
|
|
96
|
+
|
|
97
|
+
4. **Recommend specific actions** for each Tier 1 and Tier 2 violation:
|
|
98
|
+
- For high complexity: suggest extract-method or strategy pattern refactoring
|
|
99
|
+
- For high coupling: suggest interface extraction or dependency inversion
|
|
100
|
+
- For benchmark regressions: suggest profiling the specific code path
|
|
101
|
+
- For size budget violations: suggest module decomposition
|
|
102
|
+
|
|
103
|
+
5. **Output the report** in structured markdown format suitable for PR comments or CI output.
|
|
104
|
+
|
|
105
|
+
---
|
|
106
|
+
|
|
107
|
+
### Phase 4: ENFORCE — Apply Gate Decisions
|
|
108
|
+
|
|
109
|
+
1. **Tier 1 violations present** — FAIL. Block commit and merge. List all Tier 1 violations with their locations and values. The developer must fix these before proceeding.
|
|
110
|
+
|
|
111
|
+
2. **Tier 2 violations present, no Tier 1** — WARN. Allow commit but block merge until addressed. List all Tier 2 violations. These must be resolved before the PR can be merged.
|
|
112
|
+
|
|
113
|
+
3. **Only Tier 3 or no violations** — PASS. Proceed normally. Log Tier 3 violations as informational notes.
|
|
114
|
+
|
|
115
|
+
4. **Record gate decision** in `.harness/state.json` under a `perfGate` key:
|
|
116
|
+
|
|
117
|
+
```json
|
|
118
|
+
{
|
|
119
|
+
"perfGate": {
|
|
120
|
+
"result": "pass|warn|fail",
|
|
121
|
+
"tier1Count": 0,
|
|
122
|
+
"tier2Count": 0,
|
|
123
|
+
"tier3Count": 0,
|
|
124
|
+
"timestamp": "ISO-8601"
|
|
125
|
+
}
|
|
126
|
+
}
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
5. **Exit with appropriate code:** 0 for pass, 1 for fail, 0 for warn (with warning output).
|
|
130
|
+
|
|
131
|
+
---
|
|
132
|
+
|
|
133
|
+
## Harness Integration
|
|
134
|
+
|
|
135
|
+
- **`harness check-perf`** — Primary command for all performance checks. Runs structural and coupling analysis.
|
|
136
|
+
- **`harness check-perf --structural`** — Run only structural complexity checks.
|
|
137
|
+
- **`harness check-perf --coupling`** — Run only coupling analysis.
|
|
138
|
+
- **`harness perf bench`** — Run benchmarks only. Requires clean working tree.
|
|
139
|
+
- **`harness perf baselines show`** — View current benchmark baselines.
|
|
140
|
+
- **`harness perf baselines update`** — Persist current benchmark results as new baselines.
|
|
141
|
+
- **`harness perf critical-paths`** — View the current critical path set and how it was determined.
|
|
142
|
+
- **`harness validate`** — Run after enforcement to verify overall project health.
|
|
143
|
+
- **`harness graph scan`** — Refresh knowledge graph for accurate hotspot scoring.
|
|
144
|
+
|
|
145
|
+
## Tier Classification
|
|
146
|
+
|
|
147
|
+
| Tier | Severity | Gate | Examples |
|
|
148
|
+
| ---- | -------- | ------------ | --------------------------------------------------------------------------------------------------- |
|
|
149
|
+
| 1 | error | Block commit | Cyclomatic complexity > 15, >5% regression on critical path, hotspot in top 5%, circular dependency |
|
|
150
|
+
| 2 | warning | Block merge | Complexity > 10, nesting > 4, >10% regression elsewhere, fan-out > 10, size budget exceeded |
|
|
151
|
+
| 3 | info | None | File length > 300 lines, fan-in > 20, transitive depth > 30, >5% non-critical regression |
|
|
152
|
+
|
|
153
|
+
## Success Criteria
|
|
154
|
+
|
|
155
|
+
- All Tier 1 violations are resolved before proceeding
|
|
156
|
+
- Performance report follows structured format with tier classification
|
|
157
|
+
- Benchmark regressions are compared against noise margin before flagging
|
|
158
|
+
- Gate decision is recorded in state
|
|
159
|
+
- `harness validate` passes after enforcement
|
|
160
|
+
|
|
161
|
+
## Examples
|
|
162
|
+
|
|
163
|
+
### Example: PR with High Complexity Function
|
|
164
|
+
|
|
165
|
+
```
|
|
166
|
+
Phase 1: ANALYZE
|
|
167
|
+
harness check-perf --structural
|
|
168
|
+
Result: processOrderBatch() in src/orders/processor.ts has cyclomatic complexity 18 (Tier 1, threshold: 15)
|
|
169
|
+
|
|
170
|
+
Phase 2: BENCHMARK — skipped (Tier 1 violation found)
|
|
171
|
+
|
|
172
|
+
Phase 3: REPORT
|
|
173
|
+
TIER 1 VIOLATIONS (1):
|
|
174
|
+
- src/orders/processor.ts:processOrderBatch — complexity 18 > 15
|
|
175
|
+
Recommendation: Extract validation and transformation into separate functions
|
|
176
|
+
|
|
177
|
+
Phase 4: ENFORCE
|
|
178
|
+
Result: FAIL — 1 Tier 1 violation. Commit blocked.
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
### Example: Benchmark Regression on Critical Path
|
|
182
|
+
|
|
183
|
+
```
|
|
184
|
+
Phase 1: ANALYZE — no structural violations
|
|
185
|
+
|
|
186
|
+
Phase 2: BENCHMARK
|
|
187
|
+
harness perf bench
|
|
188
|
+
Baseline: parseDocument 4.2ms, current: 4.8ms (+14.3%)
|
|
189
|
+
parseDocument is @perf-critical — Tier 1 threshold applies (>5%)
|
|
190
|
+
|
|
191
|
+
Phase 3: REPORT
|
|
192
|
+
TIER 1 VIOLATIONS (1):
|
|
193
|
+
- parseDocument: 14.3% regression on critical path (threshold: 5%)
|
|
194
|
+
Recommendation: Profile parseDocument to identify the regression source
|
|
195
|
+
|
|
196
|
+
Phase 4: ENFORCE
|
|
197
|
+
Result: FAIL — 1 Tier 1 violation. Merge blocked.
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
### Example: Clean PR with Minor Warnings
|
|
201
|
+
|
|
202
|
+
```
|
|
203
|
+
Phase 1: ANALYZE
|
|
204
|
+
harness check-perf --structural --coupling
|
|
205
|
+
Result: src/utils/formatter.ts has 320 lines (Tier 3, threshold: 300)
|
|
206
|
+
|
|
207
|
+
Phase 2: BENCHMARK
|
|
208
|
+
harness perf bench — all within noise margin
|
|
209
|
+
|
|
210
|
+
Phase 3: REPORT
|
|
211
|
+
TIER 3 INFO (1):
|
|
212
|
+
- src/utils/formatter.ts: 320 lines > 300 line threshold
|
|
213
|
+
No Tier 1 or Tier 2 violations.
|
|
214
|
+
|
|
215
|
+
Phase 4: ENFORCE
|
|
216
|
+
Result: PASS — no blocking violations.
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
## Gates
|
|
220
|
+
|
|
221
|
+
- **No ignoring Tier 1 violations.** They must be fixed or the threshold must be reconfigured (with documented justification).
|
|
222
|
+
- **No running benchmarks with dirty working tree.** Uncommitted changes invalidate benchmark results.
|
|
223
|
+
- **No updating baselines without running benchmarks.** Baselines must come from fresh runs against committed code.
|
|
224
|
+
- **No suppressing violations without documentation.** If a threshold is relaxed, the rationale must be documented in the project configuration.
|
|
225
|
+
|
|
226
|
+
## Escalation
|
|
227
|
+
|
|
228
|
+
- **When Tier 1 violations cannot be fixed within the current task:** Propose refactoring the function into smaller units, or raising the threshold with a documented justification. Do not silently skip the violation.
|
|
229
|
+
- **When benchmark results are noisy or inconsistent:** Increase warmup iterations, pin the runtime environment, or run benchmarks in isolation. Report the noise level so the developer can make an informed decision.
|
|
230
|
+
- **When critical path detection seems wrong:** Check `@perf-critical` annotations in source files and verify graph fan-in thresholds. The critical path set can be overridden in `.harness/perf/critical-paths.json`.
|
|
231
|
+
- **When a violation is a false positive:** Document it with a `// perf-ignore: <reason>` comment and add the exception to `.harness/perf/exceptions.json`.
|
|
@@ -0,0 +1,47 @@
|
|
|
1
|
+
name: harness-perf
|
|
2
|
+
version: "1.0.0"
|
|
3
|
+
description: Performance enforcement and benchmark management
|
|
4
|
+
cognitive_mode: meticulous-verifier
|
|
5
|
+
triggers:
|
|
6
|
+
- manual
|
|
7
|
+
- on_pr
|
|
8
|
+
platforms:
|
|
9
|
+
- claude-code
|
|
10
|
+
- gemini-cli
|
|
11
|
+
tools:
|
|
12
|
+
- Bash
|
|
13
|
+
- Read
|
|
14
|
+
- Write
|
|
15
|
+
- Edit
|
|
16
|
+
- Glob
|
|
17
|
+
- Grep
|
|
18
|
+
cli:
|
|
19
|
+
command: harness skill run harness-perf
|
|
20
|
+
args:
|
|
21
|
+
- name: path
|
|
22
|
+
description: Project root path
|
|
23
|
+
required: false
|
|
24
|
+
mcp:
|
|
25
|
+
tool: run_skill
|
|
26
|
+
input:
|
|
27
|
+
skill: harness-perf
|
|
28
|
+
path: string
|
|
29
|
+
type: rigid
|
|
30
|
+
phases:
|
|
31
|
+
- name: analyze
|
|
32
|
+
description: Run structural complexity and coupling checks
|
|
33
|
+
required: true
|
|
34
|
+
- name: benchmark
|
|
35
|
+
description: Run benchmarks and detect regressions
|
|
36
|
+
required: false
|
|
37
|
+
- name: report
|
|
38
|
+
description: Generate perf report with violations and recommendations
|
|
39
|
+
required: true
|
|
40
|
+
- name: enforce
|
|
41
|
+
description: Apply tier-based gate decisions
|
|
42
|
+
required: true
|
|
43
|
+
state:
|
|
44
|
+
persistent: false
|
|
45
|
+
files: []
|
|
46
|
+
depends_on:
|
|
47
|
+
- harness-verify
|
|
@@ -0,0 +1,236 @@
|
|
|
1
|
+
# Harness Perf TDD
|
|
2
|
+
|
|
3
|
+
> Red-Green-Refactor with performance assertions. Every feature gets a correctness test AND a benchmark. No optimization without measurement.
|
|
4
|
+
|
|
5
|
+
## When to Use
|
|
6
|
+
|
|
7
|
+
- Implementing performance-critical features
|
|
8
|
+
- When the spec includes performance requirements (e.g., "must respond in < 100ms")
|
|
9
|
+
- When modifying `@perf-critical` annotated code
|
|
10
|
+
- When adding hot-path logic (parsers, serializers, query resolvers, middleware)
|
|
11
|
+
- NOT for non-performance-sensitive code (use harness-tdd instead)
|
|
12
|
+
- NOT for refactoring existing code that already has benchmarks (use harness-refactoring + harness-perf)
|
|
13
|
+
|
|
14
|
+
## Process
|
|
15
|
+
|
|
16
|
+
### Iron Law
|
|
17
|
+
|
|
18
|
+
**No production code exists without both a failing test AND a failing benchmark that demanded its creation.**
|
|
19
|
+
|
|
20
|
+
If you find yourself writing production code before both the test and the benchmark exist, STOP. Write the test. Write the benchmark. Then implement.
|
|
21
|
+
|
|
22
|
+
---
|
|
23
|
+
|
|
24
|
+
### Phase 1: RED — Write Failing Test + Benchmark
|
|
25
|
+
|
|
26
|
+
1. **Write the correctness test** following the same process as harness-tdd Phase 1 (RED):
|
|
27
|
+
- Identify the smallest behavior to test
|
|
28
|
+
- Write ONE minimal test with a clear assertion
|
|
29
|
+
- Follow the project's test conventions
|
|
30
|
+
|
|
31
|
+
2. **Write a `.bench.ts` benchmark file** alongside the test file:
|
|
32
|
+
- Co-locate with source: `handler.ts` -> `handler.bench.ts`
|
|
33
|
+
- Use Vitest bench syntax for benchmark definitions
|
|
34
|
+
- Set a performance assertion if the spec includes one
|
|
35
|
+
|
|
36
|
+
```typescript
|
|
37
|
+
import { bench, describe } from 'vitest';
|
|
38
|
+
import { processData } from './handler';
|
|
39
|
+
|
|
40
|
+
describe('processData benchmarks', () => {
|
|
41
|
+
bench('processData with small input', () => {
|
|
42
|
+
processData(smallInput);
|
|
43
|
+
});
|
|
44
|
+
|
|
45
|
+
bench('processData with large input', () => {
|
|
46
|
+
processData(largeInput);
|
|
47
|
+
});
|
|
48
|
+
});
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
3. **Run the test** — observe failure. The function is not implemented yet, so the test should fail with "not defined" or "not a function."
|
|
52
|
+
|
|
53
|
+
4. **Run the benchmark** — observe failure or no baseline. This establishes that the benchmark exists and will track performance once the implementation lands.
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
57
|
+
### Phase 2: GREEN — Pass Test and Benchmark
|
|
58
|
+
|
|
59
|
+
1. **Write the minimum implementation** to make the correctness test pass. Do not optimize yet. The goal is correctness first.
|
|
60
|
+
|
|
61
|
+
2. **Run the test** — observe pass. If it fails, fix the implementation until it passes.
|
|
62
|
+
|
|
63
|
+
3. **Run the benchmark** — capture initial results. This is the first measurement. Note:
|
|
64
|
+
- If a performance assertion exists in the spec, verify it passes
|
|
65
|
+
- If no assertion exists, record the result as a baseline reference
|
|
66
|
+
- Do not optimize at this stage unless the assertion fails
|
|
67
|
+
|
|
68
|
+
4. **If the performance assertion fails,** you have two options:
|
|
69
|
+
- The implementation approach is fundamentally wrong (e.g., O(n^2) when O(n) is needed) — revise the algorithm
|
|
70
|
+
- The assertion is too aggressive for a first pass — note it and defer to REFACTOR phase
|
|
71
|
+
|
|
72
|
+
---
|
|
73
|
+
|
|
74
|
+
### Phase 3: REFACTOR — Optimize While Green
|
|
75
|
+
|
|
76
|
+
This phase is optional. Enter it when:
|
|
77
|
+
|
|
78
|
+
- The benchmark shows room for improvement against the performance requirement
|
|
79
|
+
- Profiling reveals an obvious bottleneck
|
|
80
|
+
- The code can be simplified while maintaining or improving performance
|
|
81
|
+
|
|
82
|
+
1. **Profile the implementation** if the benchmark result is far from the requirement. Use the benchmark output to identify the bottleneck.
|
|
83
|
+
|
|
84
|
+
2. **Refactor for performance** — consider:
|
|
85
|
+
- Algorithm improvements (sort, search, data structure choice)
|
|
86
|
+
- Caching or memoization for repeated computations
|
|
87
|
+
- Reducing allocations (object pooling, buffer reuse)
|
|
88
|
+
- Eliminating unnecessary work (early returns, lazy evaluation)
|
|
89
|
+
|
|
90
|
+
3. **After each change,** run both checks:
|
|
91
|
+
- **Test:** Still passing? If not, the refactor broke correctness. Revert.
|
|
92
|
+
- **Benchmark:** Improved? If not, the refactor was not effective. Consider reverting.
|
|
93
|
+
|
|
94
|
+
4. **Stop when** the benchmark meets the performance requirement, or when further optimization yields diminishing returns (< 1% improvement per change).
|
|
95
|
+
|
|
96
|
+
5. **Do not gold-plate.** If the requirement is "< 100ms" and you are at 40ms, stop. Move on.
|
|
97
|
+
|
|
98
|
+
---
|
|
99
|
+
|
|
100
|
+
### Phase 4: VALIDATE — Harness Checks
|
|
101
|
+
|
|
102
|
+
1. **Run `harness check-perf`** to verify no Tier 1 or Tier 2 violations were introduced by the implementation:
|
|
103
|
+
- Cyclomatic complexity within thresholds
|
|
104
|
+
- Coupling metrics acceptable
|
|
105
|
+
- No benchmark regressions in other modules
|
|
106
|
+
|
|
107
|
+
2. **Run `harness validate`** to verify overall project health:
|
|
108
|
+
- All tests pass
|
|
109
|
+
- Linter clean
|
|
110
|
+
- Type checks pass
|
|
111
|
+
|
|
112
|
+
3. **Update baselines** if this is a new benchmark:
|
|
113
|
+
|
|
114
|
+
```bash
|
|
115
|
+
harness perf baselines update
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
This persists the current benchmark results so future runs can detect regressions.
|
|
119
|
+
|
|
120
|
+
4. **Commit with a descriptive message** that mentions both the feature and its performance characteristics:
|
|
121
|
+
```
|
|
122
|
+
feat(parser): add streaming JSON parser (<50ms for 1MB payloads)
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
---
|
|
126
|
+
|
|
127
|
+
## Benchmark File Convention
|
|
128
|
+
|
|
129
|
+
Benchmark files are co-located with their source files, using the `.bench.ts` extension:
|
|
130
|
+
|
|
131
|
+
| Source File | Benchmark File |
|
|
132
|
+
| ----------------------------- | ----------------------------------- |
|
|
133
|
+
| `src/parser/handler.ts` | `src/parser/handler.bench.ts` |
|
|
134
|
+
| `src/api/resolver.ts` | `src/api/resolver.bench.ts` |
|
|
135
|
+
| `packages/core/src/engine.ts` | `packages/core/src/engine.bench.ts` |
|
|
136
|
+
|
|
137
|
+
Each benchmark file should:
|
|
138
|
+
|
|
139
|
+
- Import only from the module under test
|
|
140
|
+
- Define benchmarks in a `describe` block named after the module
|
|
141
|
+
- Include both small-input and large-input cases when applicable
|
|
142
|
+
- Use realistic data (not empty objects or trivial inputs)
|
|
143
|
+
|
|
144
|
+
---
|
|
145
|
+
|
|
146
|
+
## Harness Integration
|
|
147
|
+
|
|
148
|
+
- **`harness check-perf`** — Run after implementation to check for violations
|
|
149
|
+
- **`harness perf bench`** — Run benchmarks in isolation
|
|
150
|
+
- **`harness perf baselines update`** — Persist benchmark results as new baselines
|
|
151
|
+
- **`harness validate`** — Full project health check
|
|
152
|
+
- **`harness perf critical-paths`** — View critical path set to understand which benchmarks have stricter thresholds
|
|
153
|
+
|
|
154
|
+
## Success Criteria
|
|
155
|
+
|
|
156
|
+
- Every new function has both a test file (`.test.ts`) and a bench file (`.bench.ts`)
|
|
157
|
+
- Benchmarks run without errors
|
|
158
|
+
- No Tier 1 performance violations after implementation
|
|
159
|
+
- Baselines are updated for new benchmarks
|
|
160
|
+
- Commit message includes performance context when relevant
|
|
161
|
+
|
|
162
|
+
## Examples
|
|
163
|
+
|
|
164
|
+
### Example: Implementing a Performance-Critical Parser
|
|
165
|
+
|
|
166
|
+
**Phase 1: RED**
|
|
167
|
+
|
|
168
|
+
```typescript
|
|
169
|
+
// src/parser/json-stream.test.ts
|
|
170
|
+
it('parses 1MB JSON in under 50ms', () => {
|
|
171
|
+
const result = parseStream(largeMbPayload);
|
|
172
|
+
expect(result).toEqual(expectedOutput);
|
|
173
|
+
});
|
|
174
|
+
|
|
175
|
+
// src/parser/json-stream.bench.ts
|
|
176
|
+
bench('parseStream 1MB', () => {
|
|
177
|
+
parseStream(largeMbPayload);
|
|
178
|
+
});
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
Run test: FAIL (parseStream not defined). Run benchmark: FAIL (no implementation).
|
|
182
|
+
|
|
183
|
+
**Phase 2: GREEN**
|
|
184
|
+
|
|
185
|
+
```typescript
|
|
186
|
+
// src/parser/json-stream.ts
|
|
187
|
+
export function parseStream(input: string): ParsedResult {
|
|
188
|
+
return JSON.parse(input); // simplest correct implementation
|
|
189
|
+
}
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
Run test: PASS. Run benchmark: 38ms average (meets <50ms requirement).
|
|
193
|
+
|
|
194
|
+
**Phase 3: REFACTOR** — skipped (38ms already meets 50ms target).
|
|
195
|
+
|
|
196
|
+
**Phase 4: VALIDATE**
|
|
197
|
+
|
|
198
|
+
```
|
|
199
|
+
harness check-perf — no violations
|
|
200
|
+
harness validate — passes
|
|
201
|
+
harness perf baselines update — baseline saved
|
|
202
|
+
git commit -m "feat(parser): add streaming JSON parser (<50ms for 1MB payloads)"
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
### Example: Optimizing an Existing Hot Path
|
|
206
|
+
|
|
207
|
+
**Phase 1: RED** — test and benchmark already exist from initial implementation.
|
|
208
|
+
|
|
209
|
+
**Phase 3: REFACTOR**
|
|
210
|
+
|
|
211
|
+
```
|
|
212
|
+
Before: resolveImports 12ms (requirement: <5ms)
|
|
213
|
+
Change: switch from recursive descent to iterative with stack
|
|
214
|
+
After: resolveImports 3.8ms
|
|
215
|
+
Test: still passing
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
**Phase 4: VALIDATE**
|
|
219
|
+
|
|
220
|
+
```
|
|
221
|
+
harness check-perf — complexity reduced from 12 to 8 (improvement)
|
|
222
|
+
harness perf baselines update — new baseline saved
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
## Gates
|
|
226
|
+
|
|
227
|
+
- **No code before test AND benchmark.** Both must exist before implementation begins.
|
|
228
|
+
- **No optimization without measurement.** Run the benchmark before and after refactoring. Gut feelings are not measurements.
|
|
229
|
+
- **No skipping VALIDATE.** `harness check-perf` and `harness validate` must pass after every cycle.
|
|
230
|
+
- **No committing without updated baselines.** New benchmarks must have baselines persisted.
|
|
231
|
+
|
|
232
|
+
## Escalation
|
|
233
|
+
|
|
234
|
+
- **When the performance requirement cannot be met:** Report the best achieved result and propose either relaxing the requirement or redesigning the approach. Include benchmark data.
|
|
235
|
+
- **When benchmarks are flaky:** Increase iteration count, add warmup, or isolate the benchmark from I/O. Report the variance so the team can decide on an acceptable noise margin.
|
|
236
|
+
- **When the test and benchmark have conflicting needs:** Correctness always wins. If a correct implementation cannot meet the performance requirement, escalate to the team for a design decision.
|
|
@@ -0,0 +1,47 @@
|
|
|
1
|
+
name: harness-perf-tdd
|
|
2
|
+
version: "1.0.0"
|
|
3
|
+
description: Performance-aware TDD with benchmark assertions in the red-green-refactor cycle
|
|
4
|
+
cognitive_mode: meticulous-implementer
|
|
5
|
+
triggers:
|
|
6
|
+
- manual
|
|
7
|
+
platforms:
|
|
8
|
+
- claude-code
|
|
9
|
+
- gemini-cli
|
|
10
|
+
tools:
|
|
11
|
+
- Bash
|
|
12
|
+
- Read
|
|
13
|
+
- Write
|
|
14
|
+
- Edit
|
|
15
|
+
- Glob
|
|
16
|
+
- Grep
|
|
17
|
+
cli:
|
|
18
|
+
command: harness skill run harness-perf-tdd
|
|
19
|
+
args:
|
|
20
|
+
- name: path
|
|
21
|
+
description: Project root path
|
|
22
|
+
required: false
|
|
23
|
+
mcp:
|
|
24
|
+
tool: run_skill
|
|
25
|
+
input:
|
|
26
|
+
skill: harness-perf-tdd
|
|
27
|
+
path: string
|
|
28
|
+
type: rigid
|
|
29
|
+
phases:
|
|
30
|
+
- name: red
|
|
31
|
+
description: Write failing test and benchmark assertion
|
|
32
|
+
required: true
|
|
33
|
+
- name: green
|
|
34
|
+
description: Implement to pass test and benchmark
|
|
35
|
+
required: true
|
|
36
|
+
- name: refactor
|
|
37
|
+
description: Optimize while keeping both green
|
|
38
|
+
required: false
|
|
39
|
+
- name: validate
|
|
40
|
+
description: Run harness check-perf and harness validate
|
|
41
|
+
required: true
|
|
42
|
+
state:
|
|
43
|
+
persistent: false
|
|
44
|
+
files: []
|
|
45
|
+
depends_on:
|
|
46
|
+
- harness-tdd
|
|
47
|
+
- harness-perf
|
|
@@ -107,7 +107,29 @@ AI Review: SKIPPED (docs/config only)
|
|
|
107
107
|
|
|
108
108
|
If any staged file contains code changes, proceed to Phase 3.
|
|
109
109
|
|
|
110
|
-
### Phase 3:
|
|
110
|
+
### Phase 3: Security Scan
|
|
111
|
+
|
|
112
|
+
Run the built-in security scanner against staged files. This is a mechanical check — no AI judgment involved.
|
|
113
|
+
|
|
114
|
+
```bash
|
|
115
|
+
# Get list of staged source files
|
|
116
|
+
git diff --cached --name-only --diff-filter=d | grep -E '\.(ts|tsx|js|jsx|go|py)$'
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
Use the `run_security_scan` MCP tool or invoke the scanner on the staged files. Report any findings:
|
|
120
|
+
|
|
121
|
+
- **Error findings (blocking):** Hardcoded secrets, eval/injection, weak crypto — these block the commit just like lint failures.
|
|
122
|
+
- **Warning/info findings (advisory):** CORS wildcards, HTTP URLs, disabled TLS — reported but do not block.
|
|
123
|
+
|
|
124
|
+
Include security scan results in the report output:
|
|
125
|
+
|
|
126
|
+
```
|
|
127
|
+
Security Scan: [PASS/WARN/FAIL] (N errors, N warnings)
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
If no source files are staged, skip the security scan.
|
|
131
|
+
|
|
132
|
+
### Phase 4: AI Review (Lightweight)
|
|
111
133
|
|
|
112
134
|
Perform a focused, lightweight review of staged changes. This is NOT a full code review — it catches obvious issues only.
|
|
113
135
|
|
|
@@ -122,7 +144,7 @@ git diff --cached
|
|
|
122
144
|
Review the staged diff for these high-signal issues only:
|
|
123
145
|
|
|
124
146
|
- **Obvious bugs:** null dereference, infinite loops, off-by-one errors, resource leaks
|
|
125
|
-
- **Security issues:** hardcoded secrets, SQL injection, path traversal, unvalidated input
|
|
147
|
+
- **Security issues:** hardcoded secrets, SQL injection, path traversal, unvalidated input (complements the mechanical scan with semantic analysis — e.g., tracing user input across function boundaries)
|
|
126
148
|
- **Broken imports:** references to files/modules that do not exist
|
|
127
149
|
- **Debug artifacts:** console.log, debugger statements, TODO/FIXME without issue reference
|
|
128
150
|
- **Type mismatches:** function called with wrong argument types (if visible in diff)
|
|
@@ -145,6 +167,7 @@ Mechanical Checks:
|
|
|
145
167
|
- Lint: PASS
|
|
146
168
|
- Types: PASS
|
|
147
169
|
- Tests: PASS (12/12)
|
|
170
|
+
- Security Scan: PASS (0 errors, 0 warnings)
|
|
148
171
|
|
|
149
172
|
AI Review: PASS (no issues found)
|
|
150
173
|
```
|
|
@@ -158,6 +181,8 @@ Mechanical Checks:
|
|
|
158
181
|
- Lint: PASS
|
|
159
182
|
- Types: PASS
|
|
160
183
|
- Tests: PASS (12/12)
|
|
184
|
+
- Security Scan: WARN (0 errors, 1 warning)
|
|
185
|
+
- [SEC-NET-001] src/cors.ts:5 — CORS wildcard origin
|
|
161
186
|
|
|
162
187
|
AI Review: 2 observations
|
|
163
188
|
1. [file:line] Possible null dereference — `user.email` accessed without null check after `findUser()` which can return null.
|