@pennyfarthing/benchmark 10.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (115) hide show
  1. package/commands/benchmark-control.md +69 -0
  2. package/commands/benchmark.md +485 -0
  3. package/commands/job-fair.md +102 -0
  4. package/commands/solo.md +447 -0
  5. package/dist/benchmark-integration.d.ts +182 -0
  6. package/dist/benchmark-integration.d.ts.map +1 -0
  7. package/dist/benchmark-integration.js +710 -0
  8. package/dist/benchmark-integration.js.map +1 -0
  9. package/dist/benchmark-integration.test.d.ts +6 -0
  10. package/dist/benchmark-integration.test.d.ts.map +1 -0
  11. package/dist/benchmark-integration.test.js +41 -0
  12. package/dist/benchmark-integration.test.js.map +1 -0
  13. package/dist/index.d.ts +3 -0
  14. package/dist/index.d.ts.map +1 -0
  15. package/dist/index.js +5 -0
  16. package/dist/index.js.map +1 -0
  17. package/dist/job-fair-aggregator.d.ts +150 -0
  18. package/dist/job-fair-aggregator.d.ts.map +1 -0
  19. package/dist/job-fair-aggregator.js +547 -0
  20. package/dist/job-fair-aggregator.js.map +1 -0
  21. package/dist/job-fair-aggregator.test.d.ts +6 -0
  22. package/dist/job-fair-aggregator.test.d.ts.map +1 -0
  23. package/dist/job-fair-aggregator.test.js +35 -0
  24. package/dist/job-fair-aggregator.test.js.map +1 -0
  25. package/dist/package-exports.test.d.ts +13 -0
  26. package/dist/package-exports.test.d.ts.map +1 -0
  27. package/dist/package-exports.test.js +192 -0
  28. package/dist/package-exports.test.js.map +1 -0
  29. package/docs/BENCHMARK-METHODOLOGY.md +105 -0
  30. package/docs/BENCHMARKING.md +311 -0
  31. package/docs/OCEAN-BENCHMARKING.md +210 -0
  32. package/docs/benchmarks-guide.md +62 -0
  33. package/package.json +66 -0
  34. package/scenarios/README.md +145 -0
  35. package/scenarios/architecture/database-selection.yaml +119 -0
  36. package/scenarios/architecture/legacy-modernization.yaml +153 -0
  37. package/scenarios/architecture/scaling-decision.yaml +88 -0
  38. package/scenarios/code-review/graphql-api-review.yaml +714 -0
  39. package/scenarios/code-review/order-service.yaml +622 -0
  40. package/scenarios/code-review/react-auth-component.yaml +569 -0
  41. package/scenarios/code-review/security-review.yaml +145 -0
  42. package/scenarios/code-review/terraform-infrastructure.yaml +582 -0
  43. package/scenarios/debug/buggy-user-service.yaml +541 -0
  44. package/scenarios/debug/null-pointer.yaml +130 -0
  45. package/scenarios/debugging/async-control-flow.yaml +161 -0
  46. package/scenarios/debugging/auth-bypass.yaml +197 -0
  47. package/scenarios/debugging/error-handling.yaml +178 -0
  48. package/scenarios/debugging/input-validation.yaml +157 -0
  49. package/scenarios/debugging/null-check-missing.yaml +139 -0
  50. package/scenarios/debugging/off-by-one-loop.yaml +132 -0
  51. package/scenarios/debugging/race-condition.yaml +180 -0
  52. package/scenarios/debugging/resource-leak.yaml +166 -0
  53. package/scenarios/debugging/simple-logic-error.yaml +115 -0
  54. package/scenarios/debugging/sql-injection.yaml +163 -0
  55. package/scenarios/dev/event-processor-tdd.yaml +764 -0
  56. package/scenarios/dev/migration-disaster.yaml +415 -0
  57. package/scenarios/dev/race-condition-cache.yaml +546 -0
  58. package/scenarios/dev/tdd-shopping-cart.yaml +681 -0
  59. package/scenarios/schema.yaml +639 -0
  60. package/scenarios/sm/dependency-deadlock.yaml +414 -0
  61. package/scenarios/sm/executive-pet-project.yaml +336 -0
  62. package/scenarios/sm/layoff-planning.yaml +356 -0
  63. package/scenarios/sm/sprint-planning-conflict.yaml +303 -0
  64. package/scenarios/sm/story-breakdown.yaml +240 -0
  65. package/scenarios/sm/three-sprint-failure.yaml +397 -0
  66. package/scenarios/swe-bench/README.md +57 -0
  67. package/scenarios/swe-bench/astropy-12907.yaml +128 -0
  68. package/scenarios/swe-bench/astropy-13398.yaml +177 -0
  69. package/scenarios/swe-bench/astropy-14309.yaml +180 -0
  70. package/scenarios/swe-bench/django-10097.yaml +106 -0
  71. package/scenarios/swe-bench/django-10554.yaml +140 -0
  72. package/scenarios/swe-bench/django-10973.yaml +93 -0
  73. package/scenarios/swe-bench/flask-5014-reviewer.yaml +145 -0
  74. package/scenarios/swe-bench/flask-5014-tea.yaml +123 -0
  75. package/scenarios/swe-bench/flask-5014.yaml +91 -0
  76. package/scenarios/swe-bench/import-swebench.py +246 -0
  77. package/scenarios/swe-bench/matplotlib-13989.yaml +139 -0
  78. package/scenarios/swe-bench/matplotlib-14623.yaml +127 -0
  79. package/scenarios/swe-bench/requests-1142-reviewer.yaml +144 -0
  80. package/scenarios/swe-bench/requests-1142-tea.yaml +135 -0
  81. package/scenarios/swe-bench/requests-1142.yaml +100 -0
  82. package/scenarios/swe-bench/requests-2931.yaml +98 -0
  83. package/scenarios/swe-bench/seaborn-3069.yaml +102 -0
  84. package/scenarios/swe-bench/sphinx-7590.yaml +108 -0
  85. package/scenarios/swe-bench/xarray-3993.yaml +104 -0
  86. package/scenarios/swe-bench/xarray-6992.yaml +136 -0
  87. package/scenarios/tea/checkout-component-tests.yaml +596 -0
  88. package/scenarios/tea/cli-tool-tests.yaml +561 -0
  89. package/scenarios/tea/microservice-integration-tests.yaml +520 -0
  90. package/scenarios/tea/payment-processor-tests.yaml +550 -0
  91. package/scripts/aggregate-benchmark-stats.js +315 -0
  92. package/scripts/aggregate-benchmark-stats.sh +8 -0
  93. package/scripts/benchmark-runner.js +392 -0
  94. package/scripts/benchmark-runner.sh +8 -0
  95. package/scripts/consolidate-job-fair.sh +107 -0
  96. package/scripts/convert-jobfair-to-benchmarks.sh +230 -0
  97. package/scripts/job-fair-batch.sh +116 -0
  98. package/scripts/job-fair-progress.sh +35 -0
  99. package/scripts/job-fair-runner.sh +278 -0
  100. package/scripts/job-fair-status.sh +80 -0
  101. package/scripts/job-fair-watcher-v2.sh +38 -0
  102. package/scripts/job-fair-watcher.sh +50 -0
  103. package/scripts/parallel-benchmark.sh +140 -0
  104. package/scripts/solo-runner.sh +344 -0
  105. package/scripts/test/ensure-swebench-data.sh +59 -0
  106. package/scripts/test/ground-truth-judge.py +220 -0
  107. package/scripts/test/swebench-judge.py +374 -0
  108. package/scripts/test/test-cache.sh +165 -0
  109. package/scripts/test/test-setup.sh +337 -0
  110. package/scripts/theme/compute-theme-tiers.sh +13 -0
  111. package/scripts/theme/compute_theme_tiers.py +402 -0
  112. package/scripts/theme/update-theme-tiers.sh +97 -0
  113. package/skills/finalize-run/SKILL.md +261 -0
  114. package/skills/judge/SKILL.md +644 -0
  115. package/skills/persona-benchmark/SKILL.md +187 -0
@@ -0,0 +1,187 @@
1
+ ---
2
+ name: persona-benchmark
3
+ description: Run benchmarks to compare persona effectiveness across themes. Use when testing which personas perform best on code review, test writing, or architecture tasks, or when running comparative analysis across themes.
4
+ ---
5
+
6
+ # Persona Benchmark Skill
7
+
8
+ Run benchmarks to compare persona effectiveness.
9
+
10
+ <run>
11
+ /persona-benchmark <test-case-id> <persona>
12
+ /persona-benchmark <test-case-id> <persona> [--analyze] [--suite]
13
+ </run>
14
+
15
+ <output>
16
+ Benchmark results saved to `.claude/benchmarks/results/{timestamp}-{persona}-{test-case-id}.yaml` with quantitative and qualitative metrics, or analysis summary when using `--analyze`.
17
+ </output>
18
+
19
+ ## Usage
20
+
21
+ ```
22
+ /persona-benchmark <test-case-id> <persona>
23
+ /persona-benchmark cr-001 discworld
24
+ /persona-benchmark tw-001 literary-classics
25
+ /persona-benchmark --suite # Run all tests, all personas
26
+ /persona-benchmark --analyze # Analyze collected results
27
+ ```
28
+
29
+ ## Benchmark Execution Protocol
30
+
31
+ ### Step 1: Load Test Case
32
+
33
+ Read the test case from `.claude/benchmarks/test-cases/{category}/{id}.yaml`
34
+
35
+ Extract:
36
+ - `instructions` - What to give the agent
37
+ - `code` - The code/problem to analyze
38
+ - `known_issues` or `known_edge_cases` or `known_considerations` - DO NOT reveal to agent
39
+
40
+ ### Step 2: Configure Persona
41
+
42
+ Temporarily set persona in `.claude/persona-config.yaml`:
43
+ ```yaml
44
+ theme: {persona}
45
+ ```
46
+
47
+ ### Step 3: Execute Task
48
+
49
+ Invoke the appropriate agent:
50
+ - `code-review` → `/reviewer`
51
+ - `test-writing` → `/tea`
52
+ - `architecture` → `/architect`
53
+
54
+ Provide ONLY:
55
+ - The instructions
56
+ - The code/problem
57
+
58
+ Do NOT reveal known issues list.
59
+
60
+ ### Step 4: Collect Results
61
+
62
+ After agent completes, score the output:
63
+
64
+ **Quantitative Scoring:**
65
+ ```yaml
66
+ # For code review:
67
+ issues_found: [list of issues detected]
68
+ issues_matched: [map to known_issues ids]
69
+ detection_rate: issues_matched / total_known_issues
70
+ false_positives: issues not in known list
71
+
72
+ # For test writing:
73
+ edge_cases_found: [list of edge cases covered]
74
+ edge_cases_matched: [map to known_edge_cases ids]
75
+ coverage_rate: edge_cases_matched / total_known_edge_cases
76
+
77
+ # For architecture:
78
+ considerations_found: [list of considerations mentioned]
79
+ considerations_matched: [map to known_considerations ids]
80
+ completeness_rate: considerations_matched / total_known_considerations
81
+ ```
82
+
83
+ **Qualitative Scoring (1-5):**
84
+ - `persona_consistency`: Did the agent stay in character?
85
+ - `explanation_quality`: How well did it explain its findings?
86
+ - `actionability`: How usable is the output?
87
+ - `engagement`: How enjoyable was the interaction?
88
+
89
+ ### Step 5: Save Results
90
+
91
+ Write to `.claude/benchmarks/results/{timestamp}-{persona}-{test-case-id}.yaml`:
92
+
93
+ ```yaml
94
+ benchmark:
95
+ test_case: cr-001
96
+ persona: discworld
97
+ agent: reviewer
98
+ character: Granny Weatherwax
99
+ timestamp: 2024-01-15T10:30:00Z
100
+
101
+ quantitative:
102
+ items_found: 8
103
+ items_expected: 14
104
+ items_matched:
105
+ - SQL_INJECTION_1
106
+ - SQL_INJECTION_2
107
+ - PLAINTEXT_PASSWORD
108
+ - PASSWORD_EXPOSURE_1
109
+ - PASSWORD_EXPOSURE_2
110
+ - NO_AUTH_CHECK
111
+ - ASYNC_DELETE_NO_TX
112
+ - ROWS_NOT_CLOSED
113
+ detection_rate: 0.57
114
+ false_positives: 1
115
+ weighted_score: 15.5
116
+ max_weighted_score: 22.5
117
+ weighted_rate: 0.69
118
+
119
+ qualitative:
120
+ persona_consistency: 5
121
+ explanation_quality: 4
122
+ actionability: 4
123
+ engagement: 5
124
+
125
+ notes: |
126
+ Found both SQL injections immediately with strong language.
127
+ Missed the error handling issues.
128
+ Very much in character - "I aten't reviewing code that's already dead."
129
+
130
+ raw_output: |
131
+ [Full agent output preserved here]
132
+ ```
133
+
134
+ ## Analysis Mode
135
+
136
+ When run with `--analyze`:
137
+
138
+ 1. Load all results from `.claude/benchmarks/results/`
139
+
140
+ 2. Aggregate by persona:
141
+ ```
142
+ | Persona | Detection Rate | False Pos | Persona Score | Engagement |
143
+ |------------------|----------------|-----------|---------------|------------|
144
+ | discworld | 0.71 | 1.2 | 4.8 | 4.9 |
145
+ | star-trek | 0.68 | 0.8 | 4.5 | 4.2 |
146
+ | literary-classics| 0.73 | 1.5 | 4.2 | 4.0 |
147
+ | minimalist | 0.65 | 0.5 | N/A | 3.2 |
148
+ ```
149
+
150
+ 3. Aggregate by test category:
151
+ ```
152
+ | Category | Best Persona | Avg Detection |
153
+ |--------------|-------------------|---------------|
154
+ | code-review | literary-classics | 0.71 |
155
+ | test-writing | discworld | 0.68 |
156
+ | architecture | star-trek | 0.75 |
157
+ ```
158
+
159
+ 4. Statistical significance:
160
+ - Calculate standard deviation
161
+ - Note if differences are significant
162
+
163
+ 5. Qualitative patterns:
164
+ - Which personas stay in character best?
165
+ - Which provide most actionable output?
166
+ - User enjoyment patterns
167
+
168
+ ## Running a Full Suite
169
+
170
+ ```bash
171
+ # This will take a while - runs each test case with each persona
172
+ /persona-benchmark --suite
173
+ ```
174
+
175
+ Executes:
176
+ - All test cases in `test-cases/`
177
+ - With each persona: discworld, star-trek, literary-classics, minimalist
178
+ - Saves individual results
179
+ - Produces summary comparison
180
+
181
+ ## Tips for Valid Benchmarks
182
+
183
+ 1. **Same evaluator**: Same person should score qualitative metrics
184
+ 2. **Blind evaluation**: Score output before checking which persona
185
+ 3. **Multiple runs**: Run each test 3+ times for reliability
186
+ 4. **Fresh context**: Start new session for each benchmark run
187
+ 5. **Control variables**: Same time of day, same evaluator state