@pennyfarthing/benchmark 10.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/commands/benchmark-control.md +69 -0
- package/commands/benchmark.md +485 -0
- package/commands/job-fair.md +102 -0
- package/commands/solo.md +447 -0
- package/dist/benchmark-integration.d.ts +182 -0
- package/dist/benchmark-integration.d.ts.map +1 -0
- package/dist/benchmark-integration.js +710 -0
- package/dist/benchmark-integration.js.map +1 -0
- package/dist/benchmark-integration.test.d.ts +6 -0
- package/dist/benchmark-integration.test.d.ts.map +1 -0
- package/dist/benchmark-integration.test.js +41 -0
- package/dist/benchmark-integration.test.js.map +1 -0
- package/dist/index.d.ts +3 -0
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +5 -0
- package/dist/index.js.map +1 -0
- package/dist/job-fair-aggregator.d.ts +150 -0
- package/dist/job-fair-aggregator.d.ts.map +1 -0
- package/dist/job-fair-aggregator.js +547 -0
- package/dist/job-fair-aggregator.js.map +1 -0
- package/dist/job-fair-aggregator.test.d.ts +6 -0
- package/dist/job-fair-aggregator.test.d.ts.map +1 -0
- package/dist/job-fair-aggregator.test.js +35 -0
- package/dist/job-fair-aggregator.test.js.map +1 -0
- package/dist/package-exports.test.d.ts +13 -0
- package/dist/package-exports.test.d.ts.map +1 -0
- package/dist/package-exports.test.js +192 -0
- package/dist/package-exports.test.js.map +1 -0
- package/docs/BENCHMARK-METHODOLOGY.md +105 -0
- package/docs/BENCHMARKING.md +311 -0
- package/docs/OCEAN-BENCHMARKING.md +210 -0
- package/docs/benchmarks-guide.md +62 -0
- package/package.json +66 -0
- package/scenarios/README.md +145 -0
- package/scenarios/architecture/database-selection.yaml +119 -0
- package/scenarios/architecture/legacy-modernization.yaml +153 -0
- package/scenarios/architecture/scaling-decision.yaml +88 -0
- package/scenarios/code-review/graphql-api-review.yaml +714 -0
- package/scenarios/code-review/order-service.yaml +622 -0
- package/scenarios/code-review/react-auth-component.yaml +569 -0
- package/scenarios/code-review/security-review.yaml +145 -0
- package/scenarios/code-review/terraform-infrastructure.yaml +582 -0
- package/scenarios/debug/buggy-user-service.yaml +541 -0
- package/scenarios/debug/null-pointer.yaml +130 -0
- package/scenarios/debugging/async-control-flow.yaml +161 -0
- package/scenarios/debugging/auth-bypass.yaml +197 -0
- package/scenarios/debugging/error-handling.yaml +178 -0
- package/scenarios/debugging/input-validation.yaml +157 -0
- package/scenarios/debugging/null-check-missing.yaml +139 -0
- package/scenarios/debugging/off-by-one-loop.yaml +132 -0
- package/scenarios/debugging/race-condition.yaml +180 -0
- package/scenarios/debugging/resource-leak.yaml +166 -0
- package/scenarios/debugging/simple-logic-error.yaml +115 -0
- package/scenarios/debugging/sql-injection.yaml +163 -0
- package/scenarios/dev/event-processor-tdd.yaml +764 -0
- package/scenarios/dev/migration-disaster.yaml +415 -0
- package/scenarios/dev/race-condition-cache.yaml +546 -0
- package/scenarios/dev/tdd-shopping-cart.yaml +681 -0
- package/scenarios/schema.yaml +639 -0
- package/scenarios/sm/dependency-deadlock.yaml +414 -0
- package/scenarios/sm/executive-pet-project.yaml +336 -0
- package/scenarios/sm/layoff-planning.yaml +356 -0
- package/scenarios/sm/sprint-planning-conflict.yaml +303 -0
- package/scenarios/sm/story-breakdown.yaml +240 -0
- package/scenarios/sm/three-sprint-failure.yaml +397 -0
- package/scenarios/swe-bench/README.md +57 -0
- package/scenarios/swe-bench/astropy-12907.yaml +128 -0
- package/scenarios/swe-bench/astropy-13398.yaml +177 -0
- package/scenarios/swe-bench/astropy-14309.yaml +180 -0
- package/scenarios/swe-bench/django-10097.yaml +106 -0
- package/scenarios/swe-bench/django-10554.yaml +140 -0
- package/scenarios/swe-bench/django-10973.yaml +93 -0
- package/scenarios/swe-bench/flask-5014-reviewer.yaml +145 -0
- package/scenarios/swe-bench/flask-5014-tea.yaml +123 -0
- package/scenarios/swe-bench/flask-5014.yaml +91 -0
- package/scenarios/swe-bench/import-swebench.py +246 -0
- package/scenarios/swe-bench/matplotlib-13989.yaml +139 -0
- package/scenarios/swe-bench/matplotlib-14623.yaml +127 -0
- package/scenarios/swe-bench/requests-1142-reviewer.yaml +144 -0
- package/scenarios/swe-bench/requests-1142-tea.yaml +135 -0
- package/scenarios/swe-bench/requests-1142.yaml +100 -0
- package/scenarios/swe-bench/requests-2931.yaml +98 -0
- package/scenarios/swe-bench/seaborn-3069.yaml +102 -0
- package/scenarios/swe-bench/sphinx-7590.yaml +108 -0
- package/scenarios/swe-bench/xarray-3993.yaml +104 -0
- package/scenarios/swe-bench/xarray-6992.yaml +136 -0
- package/scenarios/tea/checkout-component-tests.yaml +596 -0
- package/scenarios/tea/cli-tool-tests.yaml +561 -0
- package/scenarios/tea/microservice-integration-tests.yaml +520 -0
- package/scenarios/tea/payment-processor-tests.yaml +550 -0
- package/scripts/aggregate-benchmark-stats.js +315 -0
- package/scripts/aggregate-benchmark-stats.sh +8 -0
- package/scripts/benchmark-runner.js +392 -0
- package/scripts/benchmark-runner.sh +8 -0
- package/scripts/consolidate-job-fair.sh +107 -0
- package/scripts/convert-jobfair-to-benchmarks.sh +230 -0
- package/scripts/job-fair-batch.sh +116 -0
- package/scripts/job-fair-progress.sh +35 -0
- package/scripts/job-fair-runner.sh +278 -0
- package/scripts/job-fair-status.sh +80 -0
- package/scripts/job-fair-watcher-v2.sh +38 -0
- package/scripts/job-fair-watcher.sh +50 -0
- package/scripts/parallel-benchmark.sh +140 -0
- package/scripts/solo-runner.sh +344 -0
- package/scripts/test/ensure-swebench-data.sh +59 -0
- package/scripts/test/ground-truth-judge.py +220 -0
- package/scripts/test/swebench-judge.py +374 -0
- package/scripts/test/test-cache.sh +165 -0
- package/scripts/test/test-setup.sh +337 -0
- package/scripts/theme/compute-theme-tiers.sh +13 -0
- package/scripts/theme/compute_theme_tiers.py +402 -0
- package/scripts/theme/update-theme-tiers.sh +97 -0
- package/skills/finalize-run/SKILL.md +261 -0
- package/skills/judge/SKILL.md +644 -0
- package/skills/persona-benchmark/SKILL.md +187 -0
|
@@ -0,0 +1,261 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: finalize-run
|
|
3
|
+
description: Validate and save benchmark run results. Use when completing a benchmark run, validating results before storage, or ensuring all runs pass through the single guardrail exit point.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Finalize Run Skill
|
|
7
|
+
|
|
8
|
+
<run>Validates and saves benchmark run results</run>
|
|
9
|
+
<output>JSON with validation success status and saved file path</output>
|
|
10
|
+
|
|
11
|
+
All runs MUST pass through this skill before saving. This is the guardrail.
|
|
12
|
+
|
|
13
|
+
## Invocation
|
|
14
|
+
|
|
15
|
+
```
|
|
16
|
+
/finalize-run --type <type> --data <json>
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
**Types:**
|
|
20
|
+
- `solo` - Single agent evaluation
|
|
21
|
+
- `duel` - Two-agent comparison
|
|
22
|
+
- `relay` - Team relay competition
|
|
23
|
+
|
|
24
|
+
## Validation Rules
|
|
25
|
+
|
|
26
|
+
### Agent Validation
|
|
27
|
+
|
|
28
|
+
For EACH agent in the run:
|
|
29
|
+
|
|
30
|
+
| Field | Rule | Action on Fail |
|
|
31
|
+
|-------|------|----------------|
|
|
32
|
+
| `cli_timestamp` | Valid ISO8601 | REJECT |
|
|
33
|
+
| `response_text` | ≥ 200 characters | REJECT |
|
|
34
|
+
| `input_tokens` | > 0 | REJECT |
|
|
35
|
+
| `output_tokens` | > 0 | REJECT |
|
|
36
|
+
|
|
37
|
+
### Judge Validation
|
|
38
|
+
|
|
39
|
+
| Field | Rule | Action on Fail |
|
|
40
|
+
|-------|------|----------------|
|
|
41
|
+
| `cli_timestamp` | Valid ISO8601 | REJECT |
|
|
42
|
+
| `response_text` | Contains "WEIGHTED_TOTAL" or "RATING:" | REJECT |
|
|
43
|
+
| `response_text` | ≥ 100 characters | REJECT |
|
|
44
|
+
|
|
45
|
+
### Score Validation
|
|
46
|
+
|
|
47
|
+
| Field | Rule | Action on Fail |
|
|
48
|
+
|-------|------|----------------|
|
|
49
|
+
| `total` | Number 1-100 | REJECT |
|
|
50
|
+
| Extracted from judge | Matches claimed score | REJECT |
|
|
51
|
+
|
|
52
|
+
### Timestamp Sanity
|
|
53
|
+
|
|
54
|
+
```
|
|
55
|
+
elapsed = last_timestamp - first_timestamp
|
|
56
|
+
minimum = 30 × number_of_agents
|
|
57
|
+
|
|
58
|
+
if elapsed < minimum:
|
|
59
|
+
WARN: "Timestamps suspiciously close"
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
## On Invoke
|
|
63
|
+
|
|
64
|
+
### Step 1: Parse Input
|
|
65
|
+
|
|
66
|
+
Extract:
|
|
67
|
+
- `type`: solo, duel, or relay
|
|
68
|
+
- `data`: JSON with run data
|
|
69
|
+
|
|
70
|
+
**Required data structure:**
|
|
71
|
+
|
|
72
|
+
```json
|
|
73
|
+
{
|
|
74
|
+
"type": "solo|duel|relay",
|
|
75
|
+
"timestamp": "ISO8601",
|
|
76
|
+
"scenario": {"name": "...", "title": "..."},
|
|
77
|
+
"agents": [
|
|
78
|
+
{
|
|
79
|
+
"spec": "theme:agent",
|
|
80
|
+
"cli_timestamp": "ISO8601",
|
|
81
|
+
"response_text": "full response",
|
|
82
|
+
"input_tokens": 1234,
|
|
83
|
+
"output_tokens": 5678
|
|
84
|
+
}
|
|
85
|
+
],
|
|
86
|
+
"judge": {
|
|
87
|
+
"cli_timestamp": "ISO8601",
|
|
88
|
+
"response_text": "full verdict",
|
|
89
|
+
"input_tokens": 2345,
|
|
90
|
+
"output_tokens": 890
|
|
91
|
+
},
|
|
92
|
+
"scores": {"spec": score},
|
|
93
|
+
"output_path": "results/..."
|
|
94
|
+
}
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
### Step 2: Validate Agents
|
|
98
|
+
|
|
99
|
+
For each agent:
|
|
100
|
+
|
|
101
|
+
```bash
|
|
102
|
+
# Check timestamp format
|
|
103
|
+
if ! [[ "$cli_timestamp" =~ ^[0-9]{4}-[0-9]{2}-[0-9]{2}T ]]; then
|
|
104
|
+
REJECT "Invalid timestamp: $cli_timestamp"
|
|
105
|
+
fi
|
|
106
|
+
|
|
107
|
+
# Check response length
|
|
108
|
+
response_len=${#response_text}
|
|
109
|
+
if [[ $response_len -lt 200 ]]; then
|
|
110
|
+
REJECT "Response too short: $response_len chars (min 200)"
|
|
111
|
+
fi
|
|
112
|
+
|
|
113
|
+
# Check tokens
|
|
114
|
+
if [[ $input_tokens -le 0 ]] || [[ $output_tokens -le 0 ]]; then
|
|
115
|
+
REJECT "Invalid tokens: in=$input_tokens out=$output_tokens"
|
|
116
|
+
fi
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
### Step 3: Validate Judge
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
# Check for score marker
|
|
123
|
+
if ! echo "$judge_response" | grep -qE "WEIGHTED_TOTAL|RATING:"; then
|
|
124
|
+
REJECT "Judge response missing score marker"
|
|
125
|
+
fi
|
|
126
|
+
|
|
127
|
+
# Check response length
|
|
128
|
+
if [[ ${#judge_response} -lt 100 ]]; then
|
|
129
|
+
REJECT "Judge response too short"
|
|
130
|
+
fi
|
|
131
|
+
|
|
132
|
+
# Check tokens
|
|
133
|
+
if [[ $judge_input_tokens -le 0 ]] || [[ $judge_output_tokens -le 0 ]]; then
|
|
134
|
+
REJECT "Invalid judge tokens"
|
|
135
|
+
fi
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### Step 4: Validate Scores
|
|
139
|
+
|
|
140
|
+
```bash
|
|
141
|
+
# Extract score from judge
|
|
142
|
+
extracted=$(echo "$judge_response" | grep -oE "WEIGHTED_TOTAL[^0-9]*([0-9]+)" | grep -oE "[0-9]+" | tail -1)
|
|
143
|
+
|
|
144
|
+
# Verify against claimed score
|
|
145
|
+
if [[ "$extracted" != "$claimed_score" ]]; then
|
|
146
|
+
REJECT "Score mismatch: claimed=$claimed_score extracted=$extracted"
|
|
147
|
+
fi
|
|
148
|
+
|
|
149
|
+
# Check range
|
|
150
|
+
if [[ $extracted -lt 1 ]] || [[ $extracted -gt 100 ]]; then
|
|
151
|
+
REJECT "Score out of range: $extracted"
|
|
152
|
+
fi
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
### Step 5: Timestamp Sanity Check
|
|
156
|
+
|
|
157
|
+
```bash
|
|
158
|
+
# Calculate elapsed time
|
|
159
|
+
first_ts=$(date -d "$first_agent_timestamp" +%s)
|
|
160
|
+
last_ts=$(date -d "$judge_timestamp" +%s)
|
|
161
|
+
elapsed=$((last_ts - first_ts))
|
|
162
|
+
|
|
163
|
+
# Check minimum expected time
|
|
164
|
+
num_agents=${#agents[@]}
|
|
165
|
+
minimum=$((30 * num_agents))
|
|
166
|
+
|
|
167
|
+
if [[ $elapsed -lt $minimum ]]; then
|
|
168
|
+
WARN "Elapsed time ${elapsed}s < expected ${minimum}s"
|
|
169
|
+
fi
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
### Step 6: Display Validation Report
|
|
173
|
+
|
|
174
|
+
```markdown
|
|
175
|
+
### Finalize Run Validation
|
|
176
|
+
|
|
177
|
+
**Type:** {type}
|
|
178
|
+
**Scenario:** {scenario.name}
|
|
179
|
+
|
|
180
|
+
#### Agent Validation
|
|
181
|
+
| Agent | Timestamp | Response | Tokens | Status |
|
|
182
|
+
|-------|-----------|----------|--------|--------|
|
|
183
|
+
| {spec} | {ts} | {len} chars | in={in} out={out} | ✓ |
|
|
184
|
+
|
|
185
|
+
#### Judge Validation
|
|
186
|
+
| Field | Value | Status |
|
|
187
|
+
|-------|-------|--------|
|
|
188
|
+
| Timestamp | {ts} | ✓ |
|
|
189
|
+
| Response | {len} chars | ✓ |
|
|
190
|
+
| Score | {score}/100 | ✓ |
|
|
191
|
+
|
|
192
|
+
#### Timestamp Sanity
|
|
193
|
+
- Elapsed: {elapsed}s
|
|
194
|
+
- Expected: ≥{minimum}s
|
|
195
|
+
- Status: ✓ PASS
|
|
196
|
+
|
|
197
|
+
---
|
|
198
|
+
**VALIDATION: PASSED**
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
### Step 7: Save Results
|
|
202
|
+
|
|
203
|
+
If ALL validations pass:
|
|
204
|
+
|
|
205
|
+
```bash
|
|
206
|
+
# Ensure directory exists
|
|
207
|
+
mkdir -p "$(dirname "$output_path")"
|
|
208
|
+
|
|
209
|
+
# Write result
|
|
210
|
+
echo "$result_json" > "$output_path"
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
Display:
|
|
214
|
+
```
|
|
215
|
+
✓ Saved to {output_path}
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
### Step 8: Return Success
|
|
219
|
+
|
|
220
|
+
```json
|
|
221
|
+
{
|
|
222
|
+
"success": true,
|
|
223
|
+
"path": "{output_path}",
|
|
224
|
+
"validation": {
|
|
225
|
+
"agents_validated": {count},
|
|
226
|
+
"judge_validated": true,
|
|
227
|
+
"scores_verified": true,
|
|
228
|
+
"timestamp_sane": true
|
|
229
|
+
}
|
|
230
|
+
}
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
## On Validation Failure
|
|
234
|
+
|
|
235
|
+
```markdown
|
|
236
|
+
### ❌ VALIDATION FAILED
|
|
237
|
+
|
|
238
|
+
**Failed Check:** {which validation}
|
|
239
|
+
**Reason:** {specific reason}
|
|
240
|
+
**Value:** {what was provided}
|
|
241
|
+
|
|
242
|
+
**This run will NOT be saved.**
|
|
243
|
+
|
|
244
|
+
To fix:
|
|
245
|
+
- {remediation steps}
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
Return:
|
|
249
|
+
```json
|
|
250
|
+
{
|
|
251
|
+
"success": false,
|
|
252
|
+
"error": "{reason}",
|
|
253
|
+
"failed_check": "{which}"
|
|
254
|
+
}
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
## The Golden Rule
|
|
258
|
+
|
|
259
|
+
**Real data or no data.**
|
|
260
|
+
|
|
261
|
+
Never estimate. Never fabricate. If validation fails, the run did not happen.
|