@pennyfarthing/benchmark 10.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/commands/benchmark-control.md +69 -0
- package/commands/benchmark.md +485 -0
- package/commands/job-fair.md +102 -0
- package/commands/solo.md +447 -0
- package/dist/benchmark-integration.d.ts +182 -0
- package/dist/benchmark-integration.d.ts.map +1 -0
- package/dist/benchmark-integration.js +710 -0
- package/dist/benchmark-integration.js.map +1 -0
- package/dist/benchmark-integration.test.d.ts +6 -0
- package/dist/benchmark-integration.test.d.ts.map +1 -0
- package/dist/benchmark-integration.test.js +41 -0
- package/dist/benchmark-integration.test.js.map +1 -0
- package/dist/index.d.ts +3 -0
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +5 -0
- package/dist/index.js.map +1 -0
- package/dist/job-fair-aggregator.d.ts +150 -0
- package/dist/job-fair-aggregator.d.ts.map +1 -0
- package/dist/job-fair-aggregator.js +547 -0
- package/dist/job-fair-aggregator.js.map +1 -0
- package/dist/job-fair-aggregator.test.d.ts +6 -0
- package/dist/job-fair-aggregator.test.d.ts.map +1 -0
- package/dist/job-fair-aggregator.test.js +35 -0
- package/dist/job-fair-aggregator.test.js.map +1 -0
- package/dist/package-exports.test.d.ts +13 -0
- package/dist/package-exports.test.d.ts.map +1 -0
- package/dist/package-exports.test.js +192 -0
- package/dist/package-exports.test.js.map +1 -0
- package/docs/BENCHMARK-METHODOLOGY.md +105 -0
- package/docs/BENCHMARKING.md +311 -0
- package/docs/OCEAN-BENCHMARKING.md +210 -0
- package/docs/benchmarks-guide.md +62 -0
- package/package.json +66 -0
- package/scenarios/README.md +145 -0
- package/scenarios/architecture/database-selection.yaml +119 -0
- package/scenarios/architecture/legacy-modernization.yaml +153 -0
- package/scenarios/architecture/scaling-decision.yaml +88 -0
- package/scenarios/code-review/graphql-api-review.yaml +714 -0
- package/scenarios/code-review/order-service.yaml +622 -0
- package/scenarios/code-review/react-auth-component.yaml +569 -0
- package/scenarios/code-review/security-review.yaml +145 -0
- package/scenarios/code-review/terraform-infrastructure.yaml +582 -0
- package/scenarios/debug/buggy-user-service.yaml +541 -0
- package/scenarios/debug/null-pointer.yaml +130 -0
- package/scenarios/debugging/async-control-flow.yaml +161 -0
- package/scenarios/debugging/auth-bypass.yaml +197 -0
- package/scenarios/debugging/error-handling.yaml +178 -0
- package/scenarios/debugging/input-validation.yaml +157 -0
- package/scenarios/debugging/null-check-missing.yaml +139 -0
- package/scenarios/debugging/off-by-one-loop.yaml +132 -0
- package/scenarios/debugging/race-condition.yaml +180 -0
- package/scenarios/debugging/resource-leak.yaml +166 -0
- package/scenarios/debugging/simple-logic-error.yaml +115 -0
- package/scenarios/debugging/sql-injection.yaml +163 -0
- package/scenarios/dev/event-processor-tdd.yaml +764 -0
- package/scenarios/dev/migration-disaster.yaml +415 -0
- package/scenarios/dev/race-condition-cache.yaml +546 -0
- package/scenarios/dev/tdd-shopping-cart.yaml +681 -0
- package/scenarios/schema.yaml +639 -0
- package/scenarios/sm/dependency-deadlock.yaml +414 -0
- package/scenarios/sm/executive-pet-project.yaml +336 -0
- package/scenarios/sm/layoff-planning.yaml +356 -0
- package/scenarios/sm/sprint-planning-conflict.yaml +303 -0
- package/scenarios/sm/story-breakdown.yaml +240 -0
- package/scenarios/sm/three-sprint-failure.yaml +397 -0
- package/scenarios/swe-bench/README.md +57 -0
- package/scenarios/swe-bench/astropy-12907.yaml +128 -0
- package/scenarios/swe-bench/astropy-13398.yaml +177 -0
- package/scenarios/swe-bench/astropy-14309.yaml +180 -0
- package/scenarios/swe-bench/django-10097.yaml +106 -0
- package/scenarios/swe-bench/django-10554.yaml +140 -0
- package/scenarios/swe-bench/django-10973.yaml +93 -0
- package/scenarios/swe-bench/flask-5014-reviewer.yaml +145 -0
- package/scenarios/swe-bench/flask-5014-tea.yaml +123 -0
- package/scenarios/swe-bench/flask-5014.yaml +91 -0
- package/scenarios/swe-bench/import-swebench.py +246 -0
- package/scenarios/swe-bench/matplotlib-13989.yaml +139 -0
- package/scenarios/swe-bench/matplotlib-14623.yaml +127 -0
- package/scenarios/swe-bench/requests-1142-reviewer.yaml +144 -0
- package/scenarios/swe-bench/requests-1142-tea.yaml +135 -0
- package/scenarios/swe-bench/requests-1142.yaml +100 -0
- package/scenarios/swe-bench/requests-2931.yaml +98 -0
- package/scenarios/swe-bench/seaborn-3069.yaml +102 -0
- package/scenarios/swe-bench/sphinx-7590.yaml +108 -0
- package/scenarios/swe-bench/xarray-3993.yaml +104 -0
- package/scenarios/swe-bench/xarray-6992.yaml +136 -0
- package/scenarios/tea/checkout-component-tests.yaml +596 -0
- package/scenarios/tea/cli-tool-tests.yaml +561 -0
- package/scenarios/tea/microservice-integration-tests.yaml +520 -0
- package/scenarios/tea/payment-processor-tests.yaml +550 -0
- package/scripts/aggregate-benchmark-stats.js +315 -0
- package/scripts/aggregate-benchmark-stats.sh +8 -0
- package/scripts/benchmark-runner.js +392 -0
- package/scripts/benchmark-runner.sh +8 -0
- package/scripts/consolidate-job-fair.sh +107 -0
- package/scripts/convert-jobfair-to-benchmarks.sh +230 -0
- package/scripts/job-fair-batch.sh +116 -0
- package/scripts/job-fair-progress.sh +35 -0
- package/scripts/job-fair-runner.sh +278 -0
- package/scripts/job-fair-status.sh +80 -0
- package/scripts/job-fair-watcher-v2.sh +38 -0
- package/scripts/job-fair-watcher.sh +50 -0
- package/scripts/parallel-benchmark.sh +140 -0
- package/scripts/solo-runner.sh +344 -0
- package/scripts/test/ensure-swebench-data.sh +59 -0
- package/scripts/test/ground-truth-judge.py +220 -0
- package/scripts/test/swebench-judge.py +374 -0
- package/scripts/test/test-cache.sh +165 -0
- package/scripts/test/test-setup.sh +337 -0
- package/scripts/theme/compute-theme-tiers.sh +13 -0
- package/scripts/theme/compute_theme_tiers.py +402 -0
- package/scripts/theme/update-theme-tiers.sh +97 -0
- package/skills/finalize-run/SKILL.md +261 -0
- package/skills/judge/SKILL.md +644 -0
- package/skills/persona-benchmark/SKILL.md +187 -0
|
@@ -0,0 +1,187 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: persona-benchmark
|
|
3
|
+
description: Run benchmarks to compare persona effectiveness across themes. Use when testing which personas perform best on code review, test writing, or architecture tasks, or when running comparative analysis across themes.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Persona Benchmark Skill
|
|
7
|
+
|
|
8
|
+
Run benchmarks to compare persona effectiveness.
|
|
9
|
+
|
|
10
|
+
<run>
|
|
11
|
+
/persona-benchmark <test-case-id> <persona>
|
|
12
|
+
/persona-benchmark <test-case-id> <persona> [--analyze] [--suite]
|
|
13
|
+
</run>
|
|
14
|
+
|
|
15
|
+
<output>
|
|
16
|
+
Benchmark results saved to `.claude/benchmarks/results/{timestamp}-{persona}-{test-case-id}.yaml` with quantitative and qualitative metrics, or analysis summary when using `--analyze`.
|
|
17
|
+
</output>
|
|
18
|
+
|
|
19
|
+
## Usage
|
|
20
|
+
|
|
21
|
+
```
|
|
22
|
+
/persona-benchmark <test-case-id> <persona>
|
|
23
|
+
/persona-benchmark cr-001 discworld
|
|
24
|
+
/persona-benchmark tw-001 literary-classics
|
|
25
|
+
/persona-benchmark --suite # Run all tests, all personas
|
|
26
|
+
/persona-benchmark --analyze # Analyze collected results
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
## Benchmark Execution Protocol
|
|
30
|
+
|
|
31
|
+
### Step 1: Load Test Case
|
|
32
|
+
|
|
33
|
+
Read the test case from `.claude/benchmarks/test-cases/{category}/{id}.yaml`
|
|
34
|
+
|
|
35
|
+
Extract:
|
|
36
|
+
- `instructions` - What to give the agent
|
|
37
|
+
- `code` - The code/problem to analyze
|
|
38
|
+
- `known_issues` or `known_edge_cases` or `known_considerations` - DO NOT reveal to agent
|
|
39
|
+
|
|
40
|
+
### Step 2: Configure Persona
|
|
41
|
+
|
|
42
|
+
Temporarily set persona in `.claude/persona-config.yaml`:
|
|
43
|
+
```yaml
|
|
44
|
+
theme: {persona}
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
### Step 3: Execute Task
|
|
48
|
+
|
|
49
|
+
Invoke the appropriate agent:
|
|
50
|
+
- `code-review` → `/reviewer`
|
|
51
|
+
- `test-writing` → `/tea`
|
|
52
|
+
- `architecture` → `/architect`
|
|
53
|
+
|
|
54
|
+
Provide ONLY:
|
|
55
|
+
- The instructions
|
|
56
|
+
- The code/problem
|
|
57
|
+
|
|
58
|
+
Do NOT reveal known issues list.
|
|
59
|
+
|
|
60
|
+
### Step 4: Collect Results
|
|
61
|
+
|
|
62
|
+
After agent completes, score the output:
|
|
63
|
+
|
|
64
|
+
**Quantitative Scoring:**
|
|
65
|
+
```yaml
|
|
66
|
+
# For code review:
|
|
67
|
+
issues_found: [list of issues detected]
|
|
68
|
+
issues_matched: [map to known_issues ids]
|
|
69
|
+
detection_rate: issues_matched / total_known_issues
|
|
70
|
+
false_positives: issues not in known list
|
|
71
|
+
|
|
72
|
+
# For test writing:
|
|
73
|
+
edge_cases_found: [list of edge cases covered]
|
|
74
|
+
edge_cases_matched: [map to known_edge_cases ids]
|
|
75
|
+
coverage_rate: edge_cases_matched / total_known_edge_cases
|
|
76
|
+
|
|
77
|
+
# For architecture:
|
|
78
|
+
considerations_found: [list of considerations mentioned]
|
|
79
|
+
considerations_matched: [map to known_considerations ids]
|
|
80
|
+
completeness_rate: considerations_matched / total_known_considerations
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
**Qualitative Scoring (1-5):**
|
|
84
|
+
- `persona_consistency`: Did the agent stay in character?
|
|
85
|
+
- `explanation_quality`: How well did it explain its findings?
|
|
86
|
+
- `actionability`: How usable is the output?
|
|
87
|
+
- `engagement`: How enjoyable was the interaction?
|
|
88
|
+
|
|
89
|
+
### Step 5: Save Results
|
|
90
|
+
|
|
91
|
+
Write to `.claude/benchmarks/results/{timestamp}-{persona}-{test-case-id}.yaml`:
|
|
92
|
+
|
|
93
|
+
```yaml
|
|
94
|
+
benchmark:
|
|
95
|
+
test_case: cr-001
|
|
96
|
+
persona: discworld
|
|
97
|
+
agent: reviewer
|
|
98
|
+
character: Granny Weatherwax
|
|
99
|
+
timestamp: 2024-01-15T10:30:00Z
|
|
100
|
+
|
|
101
|
+
quantitative:
|
|
102
|
+
items_found: 8
|
|
103
|
+
items_expected: 14
|
|
104
|
+
items_matched:
|
|
105
|
+
- SQL_INJECTION_1
|
|
106
|
+
- SQL_INJECTION_2
|
|
107
|
+
- PLAINTEXT_PASSWORD
|
|
108
|
+
- PASSWORD_EXPOSURE_1
|
|
109
|
+
- PASSWORD_EXPOSURE_2
|
|
110
|
+
- NO_AUTH_CHECK
|
|
111
|
+
- ASYNC_DELETE_NO_TX
|
|
112
|
+
- ROWS_NOT_CLOSED
|
|
113
|
+
detection_rate: 0.57
|
|
114
|
+
false_positives: 1
|
|
115
|
+
weighted_score: 15.5
|
|
116
|
+
max_weighted_score: 22.5
|
|
117
|
+
weighted_rate: 0.69
|
|
118
|
+
|
|
119
|
+
qualitative:
|
|
120
|
+
persona_consistency: 5
|
|
121
|
+
explanation_quality: 4
|
|
122
|
+
actionability: 4
|
|
123
|
+
engagement: 5
|
|
124
|
+
|
|
125
|
+
notes: |
|
|
126
|
+
Found both SQL injections immediately with strong language.
|
|
127
|
+
Missed the error handling issues.
|
|
128
|
+
Very much in character - "I aten't reviewing code that's already dead."
|
|
129
|
+
|
|
130
|
+
raw_output: |
|
|
131
|
+
[Full agent output preserved here]
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
## Analysis Mode
|
|
135
|
+
|
|
136
|
+
When run with `--analyze`:
|
|
137
|
+
|
|
138
|
+
1. Load all results from `.claude/benchmarks/results/`
|
|
139
|
+
|
|
140
|
+
2. Aggregate by persona:
|
|
141
|
+
```
|
|
142
|
+
| Persona | Detection Rate | False Pos | Persona Score | Engagement |
|
|
143
|
+
|------------------|----------------|-----------|---------------|------------|
|
|
144
|
+
| discworld | 0.71 | 1.2 | 4.8 | 4.9 |
|
|
145
|
+
| star-trek | 0.68 | 0.8 | 4.5 | 4.2 |
|
|
146
|
+
| literary-classics| 0.73 | 1.5 | 4.2 | 4.0 |
|
|
147
|
+
| minimalist | 0.65 | 0.5 | N/A | 3.2 |
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
3. Aggregate by test category:
|
|
151
|
+
```
|
|
152
|
+
| Category | Best Persona | Avg Detection |
|
|
153
|
+
|--------------|-------------------|---------------|
|
|
154
|
+
| code-review | literary-classics | 0.71 |
|
|
155
|
+
| test-writing | discworld | 0.68 |
|
|
156
|
+
| architecture | star-trek | 0.75 |
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
4. Statistical significance:
|
|
160
|
+
- Calculate standard deviation
|
|
161
|
+
- Note if differences are significant
|
|
162
|
+
|
|
163
|
+
5. Qualitative patterns:
|
|
164
|
+
- Which personas stay in character best?
|
|
165
|
+
- Which provide most actionable output?
|
|
166
|
+
- User enjoyment patterns
|
|
167
|
+
|
|
168
|
+
## Running a Full Suite
|
|
169
|
+
|
|
170
|
+
```bash
|
|
171
|
+
# This will take a while - runs each test case with each persona
|
|
172
|
+
/persona-benchmark --suite
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
Executes:
|
|
176
|
+
- All test cases in `test-cases/`
|
|
177
|
+
- With each persona: discworld, star-trek, literary-classics, minimalist
|
|
178
|
+
- Saves individual results
|
|
179
|
+
- Produces summary comparison
|
|
180
|
+
|
|
181
|
+
## Tips for Valid Benchmarks
|
|
182
|
+
|
|
183
|
+
1. **Same evaluator**: Same person should score qualitative metrics
|
|
184
|
+
2. **Blind evaluation**: Score output before checking which persona
|
|
185
|
+
3. **Multiple runs**: Run each test 3+ times for reliability
|
|
186
|
+
4. **Fresh context**: Start new session for each benchmark run
|
|
187
|
+
5. **Control variables**: Same time of day, same evaluator state
|