@pennyfarthing/benchmark 10.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (115) hide show
  1. package/commands/benchmark-control.md +69 -0
  2. package/commands/benchmark.md +485 -0
  3. package/commands/job-fair.md +102 -0
  4. package/commands/solo.md +447 -0
  5. package/dist/benchmark-integration.d.ts +182 -0
  6. package/dist/benchmark-integration.d.ts.map +1 -0
  7. package/dist/benchmark-integration.js +710 -0
  8. package/dist/benchmark-integration.js.map +1 -0
  9. package/dist/benchmark-integration.test.d.ts +6 -0
  10. package/dist/benchmark-integration.test.d.ts.map +1 -0
  11. package/dist/benchmark-integration.test.js +41 -0
  12. package/dist/benchmark-integration.test.js.map +1 -0
  13. package/dist/index.d.ts +3 -0
  14. package/dist/index.d.ts.map +1 -0
  15. package/dist/index.js +5 -0
  16. package/dist/index.js.map +1 -0
  17. package/dist/job-fair-aggregator.d.ts +150 -0
  18. package/dist/job-fair-aggregator.d.ts.map +1 -0
  19. package/dist/job-fair-aggregator.js +547 -0
  20. package/dist/job-fair-aggregator.js.map +1 -0
  21. package/dist/job-fair-aggregator.test.d.ts +6 -0
  22. package/dist/job-fair-aggregator.test.d.ts.map +1 -0
  23. package/dist/job-fair-aggregator.test.js +35 -0
  24. package/dist/job-fair-aggregator.test.js.map +1 -0
  25. package/dist/package-exports.test.d.ts +13 -0
  26. package/dist/package-exports.test.d.ts.map +1 -0
  27. package/dist/package-exports.test.js +192 -0
  28. package/dist/package-exports.test.js.map +1 -0
  29. package/docs/BENCHMARK-METHODOLOGY.md +105 -0
  30. package/docs/BENCHMARKING.md +311 -0
  31. package/docs/OCEAN-BENCHMARKING.md +210 -0
  32. package/docs/benchmarks-guide.md +62 -0
  33. package/package.json +66 -0
  34. package/scenarios/README.md +145 -0
  35. package/scenarios/architecture/database-selection.yaml +119 -0
  36. package/scenarios/architecture/legacy-modernization.yaml +153 -0
  37. package/scenarios/architecture/scaling-decision.yaml +88 -0
  38. package/scenarios/code-review/graphql-api-review.yaml +714 -0
  39. package/scenarios/code-review/order-service.yaml +622 -0
  40. package/scenarios/code-review/react-auth-component.yaml +569 -0
  41. package/scenarios/code-review/security-review.yaml +145 -0
  42. package/scenarios/code-review/terraform-infrastructure.yaml +582 -0
  43. package/scenarios/debug/buggy-user-service.yaml +541 -0
  44. package/scenarios/debug/null-pointer.yaml +130 -0
  45. package/scenarios/debugging/async-control-flow.yaml +161 -0
  46. package/scenarios/debugging/auth-bypass.yaml +197 -0
  47. package/scenarios/debugging/error-handling.yaml +178 -0
  48. package/scenarios/debugging/input-validation.yaml +157 -0
  49. package/scenarios/debugging/null-check-missing.yaml +139 -0
  50. package/scenarios/debugging/off-by-one-loop.yaml +132 -0
  51. package/scenarios/debugging/race-condition.yaml +180 -0
  52. package/scenarios/debugging/resource-leak.yaml +166 -0
  53. package/scenarios/debugging/simple-logic-error.yaml +115 -0
  54. package/scenarios/debugging/sql-injection.yaml +163 -0
  55. package/scenarios/dev/event-processor-tdd.yaml +764 -0
  56. package/scenarios/dev/migration-disaster.yaml +415 -0
  57. package/scenarios/dev/race-condition-cache.yaml +546 -0
  58. package/scenarios/dev/tdd-shopping-cart.yaml +681 -0
  59. package/scenarios/schema.yaml +639 -0
  60. package/scenarios/sm/dependency-deadlock.yaml +414 -0
  61. package/scenarios/sm/executive-pet-project.yaml +336 -0
  62. package/scenarios/sm/layoff-planning.yaml +356 -0
  63. package/scenarios/sm/sprint-planning-conflict.yaml +303 -0
  64. package/scenarios/sm/story-breakdown.yaml +240 -0
  65. package/scenarios/sm/three-sprint-failure.yaml +397 -0
  66. package/scenarios/swe-bench/README.md +57 -0
  67. package/scenarios/swe-bench/astropy-12907.yaml +128 -0
  68. package/scenarios/swe-bench/astropy-13398.yaml +177 -0
  69. package/scenarios/swe-bench/astropy-14309.yaml +180 -0
  70. package/scenarios/swe-bench/django-10097.yaml +106 -0
  71. package/scenarios/swe-bench/django-10554.yaml +140 -0
  72. package/scenarios/swe-bench/django-10973.yaml +93 -0
  73. package/scenarios/swe-bench/flask-5014-reviewer.yaml +145 -0
  74. package/scenarios/swe-bench/flask-5014-tea.yaml +123 -0
  75. package/scenarios/swe-bench/flask-5014.yaml +91 -0
  76. package/scenarios/swe-bench/import-swebench.py +246 -0
  77. package/scenarios/swe-bench/matplotlib-13989.yaml +139 -0
  78. package/scenarios/swe-bench/matplotlib-14623.yaml +127 -0
  79. package/scenarios/swe-bench/requests-1142-reviewer.yaml +144 -0
  80. package/scenarios/swe-bench/requests-1142-tea.yaml +135 -0
  81. package/scenarios/swe-bench/requests-1142.yaml +100 -0
  82. package/scenarios/swe-bench/requests-2931.yaml +98 -0
  83. package/scenarios/swe-bench/seaborn-3069.yaml +102 -0
  84. package/scenarios/swe-bench/sphinx-7590.yaml +108 -0
  85. package/scenarios/swe-bench/xarray-3993.yaml +104 -0
  86. package/scenarios/swe-bench/xarray-6992.yaml +136 -0
  87. package/scenarios/tea/checkout-component-tests.yaml +596 -0
  88. package/scenarios/tea/cli-tool-tests.yaml +561 -0
  89. package/scenarios/tea/microservice-integration-tests.yaml +520 -0
  90. package/scenarios/tea/payment-processor-tests.yaml +550 -0
  91. package/scripts/aggregate-benchmark-stats.js +315 -0
  92. package/scripts/aggregate-benchmark-stats.sh +8 -0
  93. package/scripts/benchmark-runner.js +392 -0
  94. package/scripts/benchmark-runner.sh +8 -0
  95. package/scripts/consolidate-job-fair.sh +107 -0
  96. package/scripts/convert-jobfair-to-benchmarks.sh +230 -0
  97. package/scripts/job-fair-batch.sh +116 -0
  98. package/scripts/job-fair-progress.sh +35 -0
  99. package/scripts/job-fair-runner.sh +278 -0
  100. package/scripts/job-fair-status.sh +80 -0
  101. package/scripts/job-fair-watcher-v2.sh +38 -0
  102. package/scripts/job-fair-watcher.sh +50 -0
  103. package/scripts/parallel-benchmark.sh +140 -0
  104. package/scripts/solo-runner.sh +344 -0
  105. package/scripts/test/ensure-swebench-data.sh +59 -0
  106. package/scripts/test/ground-truth-judge.py +220 -0
  107. package/scripts/test/swebench-judge.py +374 -0
  108. package/scripts/test/test-cache.sh +165 -0
  109. package/scripts/test/test-setup.sh +337 -0
  110. package/scripts/theme/compute-theme-tiers.sh +13 -0
  111. package/scripts/theme/compute_theme_tiers.py +402 -0
  112. package/scripts/theme/update-theme-tiers.sh +97 -0
  113. package/skills/finalize-run/SKILL.md +261 -0
  114. package/skills/judge/SKILL.md +644 -0
  115. package/skills/persona-benchmark/SKILL.md +187 -0
@@ -0,0 +1,261 @@
1
+ ---
2
+ name: finalize-run
3
+ description: Validate and save benchmark run results. Use when completing a benchmark run, validating results before storage, or ensuring all runs pass through the single guardrail exit point.
4
+ ---
5
+
6
+ # Finalize Run Skill
7
+
8
+ <run>Validates and saves benchmark run results</run>
9
+ <output>JSON with validation success status and saved file path</output>
10
+
11
+ All runs MUST pass through this skill before saving. This is the guardrail.
12
+
13
+ ## Invocation
14
+
15
+ ```
16
+ /finalize-run --type <type> --data <json>
17
+ ```
18
+
19
+ **Types:**
20
+ - `solo` - Single agent evaluation
21
+ - `duel` - Two-agent comparison
22
+ - `relay` - Team relay competition
23
+
24
+ ## Validation Rules
25
+
26
+ ### Agent Validation
27
+
28
+ For EACH agent in the run:
29
+
30
+ | Field | Rule | Action on Fail |
31
+ |-------|------|----------------|
32
+ | `cli_timestamp` | Valid ISO8601 | REJECT |
33
+ | `response_text` | ≥ 200 characters | REJECT |
34
+ | `input_tokens` | > 0 | REJECT |
35
+ | `output_tokens` | > 0 | REJECT |
36
+
37
+ ### Judge Validation
38
+
39
+ | Field | Rule | Action on Fail |
40
+ |-------|------|----------------|
41
+ | `cli_timestamp` | Valid ISO8601 | REJECT |
42
+ | `response_text` | Contains "WEIGHTED_TOTAL" or "RATING:" | REJECT |
43
+ | `response_text` | ≥ 100 characters | REJECT |
44
+
45
+ ### Score Validation
46
+
47
+ | Field | Rule | Action on Fail |
48
+ |-------|------|----------------|
49
+ | `total` | Number 1-100 | REJECT |
50
+ | Extracted from judge | Matches claimed score | REJECT |
51
+
52
+ ### Timestamp Sanity
53
+
54
+ ```
55
+ elapsed = last_timestamp - first_timestamp
56
+ minimum = 30 × number_of_agents
57
+
58
+ if elapsed < minimum:
59
+ WARN: "Timestamps suspiciously close"
60
+ ```
61
+
62
+ ## On Invoke
63
+
64
+ ### Step 1: Parse Input
65
+
66
+ Extract:
67
+ - `type`: solo, duel, or relay
68
+ - `data`: JSON with run data
69
+
70
+ **Required data structure:**
71
+
72
+ ```json
73
+ {
74
+ "type": "solo|duel|relay",
75
+ "timestamp": "ISO8601",
76
+ "scenario": {"name": "...", "title": "..."},
77
+ "agents": [
78
+ {
79
+ "spec": "theme:agent",
80
+ "cli_timestamp": "ISO8601",
81
+ "response_text": "full response",
82
+ "input_tokens": 1234,
83
+ "output_tokens": 5678
84
+ }
85
+ ],
86
+ "judge": {
87
+ "cli_timestamp": "ISO8601",
88
+ "response_text": "full verdict",
89
+ "input_tokens": 2345,
90
+ "output_tokens": 890
91
+ },
92
+ "scores": {"spec": score},
93
+ "output_path": "results/..."
94
+ }
95
+ ```
96
+
97
+ ### Step 2: Validate Agents
98
+
99
+ For each agent:
100
+
101
+ ```bash
102
+ # Check timestamp format
103
+ if ! [[ "$cli_timestamp" =~ ^[0-9]{4}-[0-9]{2}-[0-9]{2}T ]]; then
104
+ REJECT "Invalid timestamp: $cli_timestamp"
105
+ fi
106
+
107
+ # Check response length
108
+ response_len=${#response_text}
109
+ if [[ $response_len -lt 200 ]]; then
110
+ REJECT "Response too short: $response_len chars (min 200)"
111
+ fi
112
+
113
+ # Check tokens
114
+ if [[ $input_tokens -le 0 ]] || [[ $output_tokens -le 0 ]]; then
115
+ REJECT "Invalid tokens: in=$input_tokens out=$output_tokens"
116
+ fi
117
+ ```
118
+
119
+ ### Step 3: Validate Judge
120
+
121
+ ```bash
122
+ # Check for score marker
123
+ if ! echo "$judge_response" | grep -qE "WEIGHTED_TOTAL|RATING:"; then
124
+ REJECT "Judge response missing score marker"
125
+ fi
126
+
127
+ # Check response length
128
+ if [[ ${#judge_response} -lt 100 ]]; then
129
+ REJECT "Judge response too short"
130
+ fi
131
+
132
+ # Check tokens
133
+ if [[ $judge_input_tokens -le 0 ]] || [[ $judge_output_tokens -le 0 ]]; then
134
+ REJECT "Invalid judge tokens"
135
+ fi
136
+ ```
137
+
138
+ ### Step 4: Validate Scores
139
+
140
+ ```bash
141
+ # Extract score from judge
142
+ extracted=$(echo "$judge_response" | grep -oE "WEIGHTED_TOTAL[^0-9]*([0-9]+)" | grep -oE "[0-9]+" | tail -1)
143
+
144
+ # Verify against claimed score
145
+ if [[ "$extracted" != "$claimed_score" ]]; then
146
+ REJECT "Score mismatch: claimed=$claimed_score extracted=$extracted"
147
+ fi
148
+
149
+ # Check range
150
+ if [[ $extracted -lt 1 ]] || [[ $extracted -gt 100 ]]; then
151
+ REJECT "Score out of range: $extracted"
152
+ fi
153
+ ```
154
+
155
+ ### Step 5: Timestamp Sanity Check
156
+
157
+ ```bash
158
+ # Calculate elapsed time
159
+ first_ts=$(date -d "$first_agent_timestamp" +%s)
160
+ last_ts=$(date -d "$judge_timestamp" +%s)
161
+ elapsed=$((last_ts - first_ts))
162
+
163
+ # Check minimum expected time
164
+ num_agents=${#agents[@]}
165
+ minimum=$((30 * num_agents))
166
+
167
+ if [[ $elapsed -lt $minimum ]]; then
168
+ WARN "Elapsed time ${elapsed}s < expected ${minimum}s"
169
+ fi
170
+ ```
171
+
172
+ ### Step 6: Display Validation Report
173
+
174
+ ```markdown
175
+ ### Finalize Run Validation
176
+
177
+ **Type:** {type}
178
+ **Scenario:** {scenario.name}
179
+
180
+ #### Agent Validation
181
+ | Agent | Timestamp | Response | Tokens | Status |
182
+ |-------|-----------|----------|--------|--------|
183
+ | {spec} | {ts} | {len} chars | in={in} out={out} | ✓ |
184
+
185
+ #### Judge Validation
186
+ | Field | Value | Status |
187
+ |-------|-------|--------|
188
+ | Timestamp | {ts} | ✓ |
189
+ | Response | {len} chars | ✓ |
190
+ | Score | {score}/100 | ✓ |
191
+
192
+ #### Timestamp Sanity
193
+ - Elapsed: {elapsed}s
194
+ - Expected: ≥{minimum}s
195
+ - Status: ✓ PASS
196
+
197
+ ---
198
+ **VALIDATION: PASSED**
199
+ ```
200
+
201
+ ### Step 7: Save Results
202
+
203
+ If ALL validations pass:
204
+
205
+ ```bash
206
+ # Ensure directory exists
207
+ mkdir -p "$(dirname "$output_path")"
208
+
209
+ # Write result
210
+ echo "$result_json" > "$output_path"
211
+ ```
212
+
213
+ Display:
214
+ ```
215
+ ✓ Saved to {output_path}
216
+ ```
217
+
218
+ ### Step 8: Return Success
219
+
220
+ ```json
221
+ {
222
+ "success": true,
223
+ "path": "{output_path}",
224
+ "validation": {
225
+ "agents_validated": {count},
226
+ "judge_validated": true,
227
+ "scores_verified": true,
228
+ "timestamp_sane": true
229
+ }
230
+ }
231
+ ```
232
+
233
+ ## On Validation Failure
234
+
235
+ ```markdown
236
+ ### ❌ VALIDATION FAILED
237
+
238
+ **Failed Check:** {which validation}
239
+ **Reason:** {specific reason}
240
+ **Value:** {what was provided}
241
+
242
+ **This run will NOT be saved.**
243
+
244
+ To fix:
245
+ - {remediation steps}
246
+ ```
247
+
248
+ Return:
249
+ ```json
250
+ {
251
+ "success": false,
252
+ "error": "{reason}",
253
+ "failed_check": "{which}"
254
+ }
255
+ ```
256
+
257
+ ## The Golden Rule
258
+
259
+ **Real data or no data.**
260
+
261
+ Never estimate. Never fabricate. If validation fails, the run did not happen.