@pennyfarthing/core 7.4.1 → 7.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (127) hide show
  1. package/package.json +1 -1
  2. package/packages/core/dist/cli/commands/doctor-legacy.test.d.ts +13 -0
  3. package/packages/core/dist/cli/commands/doctor-legacy.test.d.ts.map +1 -0
  4. package/packages/core/dist/cli/commands/doctor-legacy.test.js +207 -0
  5. package/packages/core/dist/cli/commands/doctor-legacy.test.js.map +1 -0
  6. package/packages/core/dist/cli/commands/doctor.d.ts +16 -0
  7. package/packages/core/dist/cli/commands/doctor.d.ts.map +1 -1
  8. package/packages/core/dist/cli/commands/doctor.js +130 -2
  9. package/packages/core/dist/cli/commands/doctor.js.map +1 -1
  10. package/packages/core/dist/cli/commands/init.d.ts.map +1 -1
  11. package/packages/core/dist/cli/commands/init.js +17 -27
  12. package/packages/core/dist/cli/commands/init.js.map +1 -1
  13. package/packages/core/dist/cli/commands/update.d.ts.map +1 -1
  14. package/packages/core/dist/cli/commands/update.js +21 -52
  15. package/packages/core/dist/cli/commands/update.js.map +1 -1
  16. package/packages/core/dist/cli/utils/symlinks.d.ts +15 -0
  17. package/packages/core/dist/cli/utils/symlinks.d.ts.map +1 -1
  18. package/packages/core/dist/cli/utils/symlinks.js +148 -2
  19. package/packages/core/dist/cli/utils/symlinks.js.map +1 -1
  20. package/packages/core/dist/cli/utils/themes.d.ts.map +1 -1
  21. package/packages/core/dist/cli/utils/themes.js +9 -0
  22. package/packages/core/dist/cli/utils/themes.js.map +1 -1
  23. package/pennyfarthing-dist/agents/dev.md +29 -24
  24. package/pennyfarthing-dist/agents/handoff.md +42 -119
  25. package/pennyfarthing-dist/agents/reviewer.md +32 -37
  26. package/pennyfarthing-dist/agents/sm-handoff.md +43 -66
  27. package/pennyfarthing-dist/agents/sm.md +52 -35
  28. package/pennyfarthing-dist/agents/tea.md +25 -8
  29. package/pennyfarthing-dist/agents/testing-runner.md +4 -4
  30. package/pennyfarthing-dist/commands/architect.md +0 -55
  31. package/pennyfarthing-dist/commands/dev.md +1 -54
  32. package/pennyfarthing-dist/commands/devops.md +0 -52
  33. package/pennyfarthing-dist/commands/health-check.md +33 -0
  34. package/pennyfarthing-dist/commands/orchestrator.md +0 -49
  35. package/pennyfarthing-dist/commands/pm.md +0 -53
  36. package/pennyfarthing-dist/commands/reviewer.md +1 -58
  37. package/pennyfarthing-dist/commands/sm.md +1 -64
  38. package/pennyfarthing-dist/commands/sprint.md +133 -0
  39. package/pennyfarthing-dist/commands/standalone.md +194 -0
  40. package/pennyfarthing-dist/commands/tea.md +1 -57
  41. package/pennyfarthing-dist/commands/tech-writer.md +0 -46
  42. package/pennyfarthing-dist/commands/theme-maker.md +10 -5
  43. package/pennyfarthing-dist/commands/ux-designer.md +0 -55
  44. package/pennyfarthing-dist/guides/XML-TAGS.md +156 -0
  45. package/pennyfarthing-dist/guides/agent-behavior.md +64 -38
  46. package/pennyfarthing-dist/guides/measurement-framework.md +210 -0
  47. package/pennyfarthing-dist/personas/themes/a-team.yaml +130 -0
  48. package/pennyfarthing-dist/personas/themes/alice-in-wonderland.yaml +1 -1
  49. package/pennyfarthing-dist/personas/themes/ancient-strategists.yaml +1 -1
  50. package/pennyfarthing-dist/personas/themes/arcane.yaml +1 -1
  51. package/pennyfarthing-dist/personas/themes/better-call-saul.yaml +1 -1
  52. package/pennyfarthing-dist/personas/themes/big-lebowski.yaml +1 -1
  53. package/pennyfarthing-dist/personas/themes/black-sails.yaml +1 -1
  54. package/pennyfarthing-dist/personas/themes/blade-runner.yaml +1 -1
  55. package/pennyfarthing-dist/personas/themes/bobiverse.yaml +1 -1
  56. package/pennyfarthing-dist/personas/themes/breaking-bad.yaml +1 -1
  57. package/pennyfarthing-dist/personas/themes/count-of-monte-cristo.yaml +1 -1
  58. package/pennyfarthing-dist/personas/themes/cowboy-bebop.yaml +1 -1
  59. package/pennyfarthing-dist/personas/themes/deadwood.yaml +1 -1
  60. package/pennyfarthing-dist/personas/themes/dickens.yaml +1 -1
  61. package/pennyfarthing-dist/personas/themes/discworld.yaml +1 -1
  62. package/pennyfarthing-dist/personas/themes/doctor-who.yaml +1 -1
  63. package/pennyfarthing-dist/personas/themes/don-quixote.yaml +1 -1
  64. package/pennyfarthing-dist/personas/themes/dune.yaml +1 -1
  65. package/pennyfarthing-dist/personas/themes/enlightenment-thinkers.yaml +1 -1
  66. package/pennyfarthing-dist/personas/themes/expeditionary-force.yaml +1 -1
  67. package/pennyfarthing-dist/personas/themes/futurama.yaml +1 -1
  68. package/pennyfarthing-dist/personas/themes/game-of-thrones.yaml +1 -1
  69. package/pennyfarthing-dist/personas/themes/gilligans-island.yaml +131 -1
  70. package/pennyfarthing-dist/personas/themes/gothic-literature.yaml +1 -1
  71. package/pennyfarthing-dist/personas/themes/great-gatsby.yaml +1 -1
  72. package/pennyfarthing-dist/personas/themes/hannibal.yaml +1 -1
  73. package/pennyfarthing-dist/personas/themes/harry-potter.yaml +1 -1
  74. package/pennyfarthing-dist/personas/themes/his-dark-materials.yaml +1 -1
  75. package/pennyfarthing-dist/personas/themes/inspector-morse.yaml +1 -1
  76. package/pennyfarthing-dist/personas/themes/jane-austen.yaml +1 -1
  77. package/pennyfarthing-dist/personas/themes/legion-of-doom.yaml +130 -0
  78. package/pennyfarthing-dist/personas/themes/mad-max.yaml +1 -1
  79. package/pennyfarthing-dist/personas/themes/moby-dick.yaml +1 -1
  80. package/pennyfarthing-dist/personas/themes/neuromancer.yaml +1 -1
  81. package/pennyfarthing-dist/personas/themes/parks-and-rec.yaml +130 -0
  82. package/pennyfarthing-dist/personas/themes/princess-bride.yaml +130 -0
  83. package/pennyfarthing-dist/personas/themes/renaissance-masters.yaml +1 -1
  84. package/pennyfarthing-dist/personas/themes/russian-masters.yaml +1 -1
  85. package/pennyfarthing-dist/personas/themes/sandman.yaml +1 -1
  86. package/pennyfarthing-dist/personas/themes/scientific-revolutionaries.yaml +1 -1
  87. package/pennyfarthing-dist/personas/themes/shakespeare.yaml +1 -1
  88. package/pennyfarthing-dist/personas/themes/star-trek-tng.yaml +139 -3
  89. package/pennyfarthing-dist/personas/themes/star-trek-tos.yaml +124 -0
  90. package/pennyfarthing-dist/personas/themes/star-wars.yaml +1 -1
  91. package/pennyfarthing-dist/personas/themes/succession.yaml +1 -1
  92. package/pennyfarthing-dist/personas/themes/superfriends.yaml +131 -1
  93. package/pennyfarthing-dist/personas/themes/ted-lasso.yaml +131 -1
  94. package/pennyfarthing-dist/personas/themes/the-americans.yaml +1 -1
  95. package/pennyfarthing-dist/personas/themes/the-expanse.yaml +131 -1
  96. package/pennyfarthing-dist/personas/themes/the-good-place.yaml +1 -1
  97. package/pennyfarthing-dist/personas/themes/the-matrix.yaml +1 -1
  98. package/pennyfarthing-dist/personas/themes/the-sopranos.yaml +1 -1
  99. package/pennyfarthing-dist/personas/themes/west-wing.yaml +6 -6
  100. package/pennyfarthing-dist/personas/themes/world-explorers.yaml +1 -1
  101. package/pennyfarthing-dist/personas/themes/wwii-leaders.yaml +1 -1
  102. package/pennyfarthing-dist/scripts/core/check-context.sh +23 -6
  103. package/pennyfarthing-dist/scripts/core/phase-check-start.sh +95 -0
  104. package/pennyfarthing-dist/scripts/git/release.sh +3 -2
  105. package/pennyfarthing-dist/scripts/health/drift-detection.sh +162 -0
  106. package/pennyfarthing-dist/scripts/hooks/bell-mode-hook.sh +87 -0
  107. package/pennyfarthing-dist/scripts/jira/create-jira-epic.sh +1 -1
  108. package/pennyfarthing-dist/scripts/misc/deploy.sh +1 -1
  109. package/pennyfarthing-dist/scripts/misc/statusline.sh +25 -32
  110. package/pennyfarthing-dist/scripts/sprint/import-epic-to-future.mjs +377 -0
  111. package/pennyfarthing-dist/scripts/sprint/import-epic-to-future.sh +9 -0
  112. package/pennyfarthing-dist/scripts/theme/compute-theme-tiers.js +492 -0
  113. package/pennyfarthing-dist/scripts/theme/compute-theme-tiers.sh +8 -200
  114. package/pennyfarthing-dist/scripts/workflow/list-workflows.sh +38 -5
  115. package/pennyfarthing-dist/scripts/workflow/phase-owner.sh +40 -0
  116. package/pennyfarthing-dist/skills/theme-creation/SKILL.md +12 -7
  117. package/pennyfarthing-dist/workflows/epics-and-stories/steps/step-04-final-validation.md +11 -3
  118. package/pennyfarthing-dist/workflows/epics-and-stories/steps/step-05-import-to-future.md +122 -0
  119. package/pennyfarthing-dist/workflows/epics-and-stories/workflow.yaml +3 -2
  120. package/packages/core/dist/workflow/generic-handoff.d.ts +0 -281
  121. package/packages/core/dist/workflow/generic-handoff.d.ts.map +0 -1
  122. package/packages/core/dist/workflow/generic-handoff.js +0 -411
  123. package/packages/core/dist/workflow/generic-handoff.js.map +0 -1
  124. package/packages/core/dist/workflow/generic-handoff.test.d.ts +0 -21
  125. package/packages/core/dist/workflow/generic-handoff.test.d.ts.map +0 -1
  126. package/packages/core/dist/workflow/generic-handoff.test.js +0 -499
  127. package/packages/core/dist/workflow/generic-handoff.test.js.map +0 -1
@@ -0,0 +1,156 @@
1
+ # XML Tag Taxonomy
2
+
3
+ Pennyfarthing uses XML-style tags to structure agent definitions and skill documentation. These tags help LLMs identify and prioritize different types of content.
4
+
5
+ ## Priority Tags
6
+
7
+ Tags that affect LLM behavior and attention.
8
+
9
+ ### `<critical>`
10
+
11
+ **Purpose:** Non-negotiable rules that MUST be followed. LLMs should treat these as hard constraints.
12
+
13
+ **Usage:** Gates, invariants, protocol requirements, things that break the system if ignored.
14
+
15
+ ```markdown
16
+ <critical>
17
+ **Never edit sprint YAML directly.** Use scripts.
18
+ </critical>
19
+ ```
20
+
21
+ **Examples:**
22
+ - "Subagent output is NOT visible to Cyclist"
23
+ - "NEVER mark acceptance criteria as complete" (for subagents)
24
+ - "Write assessment BEFORE spawning handoff subagent"
25
+
26
+ ### `<gate>`
27
+
28
+ **Purpose:** Prerequisites that MUST be verified before proceeding. Checklist-style validation.
29
+
30
+ **Usage:** Entry/exit conditions for workflows, handoff requirements, quality gates.
31
+
32
+ ```markdown
33
+ <gate>
34
+ ## Handoff Checklist
35
+ 1. Session file exists
36
+ 2. Acceptance criteria defined
37
+ 3. Feature branches created
38
+ </gate>
39
+ ```
40
+
41
+ **Difference from `<critical>`:** Gates are procedural checkpoints; critical items are invariant rules.
42
+
43
+ ### `<info>`
44
+
45
+ **Purpose:** Contextual information that helps but doesn't constrain. Reference material.
46
+
47
+ **Usage:** Background context, defaults, file locations, tips.
48
+
49
+ ```markdown
50
+ <info>
51
+ **Workflow:** SM → TEA → Dev → Reviewer → SM
52
+ **Skills:** `/sprint`, `/jira`, `/testing`
53
+ </info>
54
+ ```
55
+
56
+ ## Identity Tags
57
+
58
+ Tags that define agent personality and role.
59
+
60
+ ### `<persona>`
61
+
62
+ **Purpose:** Character personality from the active theme. Loaded at agent activation.
63
+
64
+ **Usage:** Top of agent files, sets tone and style.
65
+
66
+ ```markdown
67
+ <persona>
68
+ Auto-loaded by `agent-session.sh start` from theme config.
69
+ **Fallback if not loaded:** Supportive, methodical, detail-oriented
70
+ </persona>
71
+ ```
72
+
73
+ ### `<role>`
74
+
75
+ **Purpose:** Agent's position in the workflow and primary responsibility.
76
+
77
+ **Usage:** Brief statement of what the agent does and when it's invoked.
78
+
79
+ ```markdown
80
+ <role>
81
+ Test specification, RED phase execution, handoff to Dev
82
+ </role>
83
+ ```
84
+
85
+ ## Structure Tags
86
+
87
+ Tags that organize agent content.
88
+
89
+ ### `<helpers>`
90
+
91
+ **Purpose:** Describes Haiku subagents and their invocation pattern.
92
+
93
+ **Usage:** Lists subagents, their purposes, and how to spawn them.
94
+
95
+ ### `<responsibilities>`
96
+
97
+ **Purpose:** Bullet list of what this agent does vs delegates.
98
+
99
+ ### `<skills>`
100
+
101
+ **Purpose:** Slash commands this agent commonly uses.
102
+
103
+ ### `<context>`
104
+
105
+ **Purpose:** Guide files and sidecars to reference.
106
+
107
+ ### `<reasoning-mode>`
108
+
109
+ **Purpose:** Verbose/quiet toggle for showing thought process.
110
+
111
+ ### `<on-activation>`
112
+
113
+ **Purpose:** Startup checklist - what to do when agent is invoked.
114
+
115
+ ### `<exit>`
116
+
117
+ **Purpose:** How to leave agent mode and cleanup.
118
+
119
+ ## Usage Guidelines
120
+
121
+ 1. **`<critical>` sparingly** - If everything is critical, nothing is. Reserve for true invariants.
122
+
123
+ 2. **`<gate>` for checkpoints** - Use when there's a clear pass/fail condition.
124
+
125
+ 3. **`<info>` generously** - Helpful context improves agent performance.
126
+
127
+ 4. **Order matters:**
128
+ ```
129
+ <persona> # Who am I?
130
+ <role> # What do I do?
131
+ <helpers> # Who helps me?
132
+ <critical> # What must I never violate?
133
+ <gate> # What must I check?
134
+ <info> # What's helpful to know?
135
+ ```
136
+
137
+ 5. **Close your tags** - Always use `</tag>` even though markdown parsers are lenient.
138
+
139
+ ## Tag Locations
140
+
141
+ | Tag | Typical Location |
142
+ |-----|------------------|
143
+ | `<critical>` | Agent files, skill files, workflow instructions |
144
+ | `<gate>` | Subagent files (handoff, finish, setup) |
145
+ | `<info>` | Agent files, guide files |
146
+ | `<persona>` | Agent files (top) |
147
+ | `<role>` | Agent files (after persona) |
148
+
149
+ ## Adding New Tags
150
+
151
+ Before adding a new tag type:
152
+
153
+ 1. Check if existing tags cover the use case
154
+ 2. Document the tag's purpose and priority level
155
+ 3. Update this file
156
+ 4. Be consistent across all files using the tag
@@ -7,7 +7,7 @@
7
7
  ## Critical Protocols
8
8
 
9
9
  <critical>
10
- **Reflector markers:** Subagent output is NOT visible to Cyclist. After handoff subagent returns `AGENT_COMMAND`, output the `marker` string verbatim. See `<agent-command-protocol>` below.
10
+ **Reflector markers:** Run `handoff-marker.sh {next_agent}` as your ABSOLUTE LAST ACTION and output the result verbatim. See `<agent-exit-protocol>` below.
11
11
  </critical>
12
12
 
13
13
  <critical>
@@ -20,7 +20,7 @@ Multi-repo: `cd $CLAUDE_PROJECT_DIR/$(get_repo_path "$repo")` after sourcing `sc
20
20
  </critical>
21
21
 
22
22
  <critical>
23
- **Handoff Action:** When `handoff` returns `AGENT_COMMAND`, output `marker` verbatim then `fallback`. Don't ask permission.
23
+ **Handoff Action:** After handoff subagent returns, run `handoff-marker.sh` as LAST action, output result, EXIT.
24
24
  </critical>
25
25
 
26
26
  <critical>
@@ -177,62 +177,88 @@ HTML comments that agents emit to signal Cyclist UI. Format: `<!-- CYCLIST:TYPE:
177
177
 
178
178
  ---
179
179
 
180
- <agent-command-protocol>
181
- ## AGENT_COMMAND Protocol
180
+ <agent-exit-protocol>
181
+ ## Agent Exit Protocol
182
182
 
183
- <critical>
184
- **Subagent output is NOT visible to Cyclist.** Tool results are not parsed for markers.
185
- Handoff subagents return an `AGENT_COMMAND` block with a pre-rendered `marker` string.
186
- The **calling agent** outputs the `marker` verbatim - no parsing or mapping required.
187
- </critical>
183
+ ### Exit Sequence
188
184
 
189
- ### How It Works
185
+ 1. Write assessment to session file
186
+ 2. Spawn `handoff` subagent
187
+ 3. Await `HANDOFF_RESULT`
188
+ 4. If `status: blocked` → report error, stop
189
+ 5. **Run this as ABSOLUTE LAST ACTION:**
190
+ ```bash
191
+ $CLAUDE_PROJECT_DIR/.pennyfarthing/scripts/core/handoff-marker.sh {next_agent}
192
+ ```
193
+ 6. **Output the script result verbatim and EXIT**
190
194
 
191
- 1. Agent writes assessment to session file FIRST
192
- 2. Agent spawns `handoff` subagent
193
- 3. Subagent runs `handoff-marker.sh {next-agent}` to generate `AGENT_COMMAND` block
194
- 4. Subagent returns the block with pre-rendered `marker` string
195
- 5. **Agent outputs `marker` verbatim, then outputs `fallback` message**
195
+ ### HANDOFF_RESULT Format
196
196
 
197
- **Single Source of Truth:** The `handoff-marker.sh` script is the authoritative source for marker format. It handles environment detection (IS_CYCLIST, USE_TIREPUMP) automatically.
197
+ ```
198
+ HANDOFF_RESULT:
199
+ status: success|blocked
200
+ next_agent: {agent_name}
201
+ error: "{message}" # if blocked
202
+ ```
198
203
 
199
- ### AGENT_COMMAND Format
204
+ ### Script Output (emit verbatim)
200
205
 
201
206
  ```
202
207
  ---
203
208
  AGENT_COMMAND:
204
- marker: "{PRE_RENDERED_MARKER_STRING}"
205
- fallback: "Run `/{agent}` to continue"
209
+ marker: "<!-- CYCLIST:HANDOFF:/dev -->"
210
+ fallback: "Run `/dev` to continue"
206
211
  ---
207
212
  ```
208
213
 
209
- The `marker` field contains the exact string to output (or empty string if no marker needed).
210
- The `fallback` field contains human-readable instructions.
214
+ **Nothing after the marker. EXIT.**
215
+ </agent-exit-protocol>
211
216
 
212
- ### Agent Action
217
+ <wrong-phase-detection>
218
+ ## Wrong Phase Detection
213
219
 
214
- **Simple rule: Output `marker` then `fallback`. That's it.**
220
+ When an agent detects the story is NOT in their phase, emit a marker immediately.
215
221
 
216
- 1. If `error: true` Report the `fallback` message as an error
217
- 2. Otherwise → Output `marker` verbatim (if non-empty), then output `fallback`
222
+ ### How to Check (works with custom workflows)
218
223
 
219
- ### Example
224
+ 1. Read `**Workflow:**` and `**Phase:**` from session file
225
+ 2. Query the phase owner:
226
+ ```bash
227
+ OWNER=$($CLAUDE_PROJECT_DIR/.pennyfarthing/scripts/core/run.sh workflow/phase-owner.sh {workflow} {phase})
228
+ ```
229
+ 3. If `$OWNER` != your agent name → story belongs to another agent
220
230
 
221
- Subagent returns:
222
- ```
223
- ---
224
- AGENT_COMMAND:
225
- marker: "<!-- CYCLIST:HANDOFF:/dev -->"
226
- fallback: "Run `/dev` to continue"
227
- ---
231
+ ### Action When Not Your Phase
232
+
233
+ ```bash
234
+ $CLAUDE_PROJECT_DIR/.pennyfarthing/scripts/core/handoff-marker.sh {OWNER}
228
235
  ```
229
236
 
230
- Agent outputs in their direct text (not a tool call):
237
+ Then output the result verbatim. This triggers Cyclist's handoff button.
238
+
239
+ ### Example
240
+
241
+ Dev reads session: `**Workflow:** tdd`, `**Phase:** review`
242
+
243
+ ```bash
244
+ OWNER=$($CLAUDE_PROJECT_DIR/.pennyfarthing/scripts/core/run.sh workflow/phase-owner.sh tdd review)
245
+ # Returns: reviewer
231
246
  ```
232
- <!-- CYCLIST:HANDOFF:/dev -->
233
247
 
234
- Run `/dev` to continue
248
+ Since "reviewer" != "dev", Dev runs:
249
+ ```bash
250
+ $CLAUDE_PROJECT_DIR/.pennyfarthing/scripts/core/handoff-marker.sh reviewer
235
251
  ```
236
252
 
237
- **CRITICAL:** The marker MUST appear in the agent's direct text output, not in a tool result.
238
- </agent-command-protocol>
253
+ ### Do NOT just say "run /reviewer"
254
+
255
+ Wrong:
256
+ > The story is in review. Run `/reviewer` to continue.
257
+
258
+ Right:
259
+ > The story is in review phase.
260
+ >
261
+ > <!-- CYCLIST:HANDOFF:/reviewer -->
262
+ >
263
+ > Run `/reviewer` to continue
264
+ </wrong-phase-detection>
@@ -0,0 +1,210 @@
1
+ # Measurement Framework for AI Evaluation
2
+
3
+ > Based on: Wallach et al. (2025). "Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge." ICML 2025.
4
+ >
5
+ > Paper: https://arxiv.org/abs/2502.00561
6
+
7
+ ## Overview
8
+
9
+ This guide establishes Pennyfarthing's approach to rigorous AI evaluation, grounded in social science measurement theory. The core insight: **evaluating AI systems is fundamentally a measurement challenge**, not merely a technical benchmarking exercise.
10
+
11
+ Current AI evaluation practices—single-metric benchmarks, leaderboard rankings—fail to capture the complexity of what we're actually trying to measure. Proper evaluation requires the same methodological rigor social scientists apply to measuring abstract constructs like intelligence, fairness, or trust.
12
+
13
+ ## The Four-Level Measurement Framework
14
+
15
+ All Pennyfarthing evaluation should distinguish between these four levels:
16
+
17
+ ```
18
+ Level 1: Background Concept
19
+
20
+ Level 2: Systematized Concept
21
+
22
+ Level 3: Operationalization
23
+
24
+ Level 4: Scores/Indicators
25
+ ```
26
+
27
+ ### Level 1: Background Concept
28
+
29
+ The broad, often contested idea we care about.
30
+
31
+ **Examples in Pennyfarthing:**
32
+ - "Code review quality"
33
+ - "Test effectiveness"
34
+ - "Agent helpfulness"
35
+ - "Persona consistency"
36
+
37
+ **Characteristics:**
38
+ - Often vague or contested
39
+ - Multiple valid interpretations exist
40
+ - May mean different things to different stakeholders
41
+
42
+ ### Level 2: Systematized Concept
43
+
44
+ A more precise definition scoped to the evaluation context.
45
+
46
+ **Example:** "Code review quality" → "The degree to which a code review identifies genuine issues, provides actionable feedback, and avoids false positives, within a TypeScript codebase context."
47
+
48
+ **Requirements:**
49
+ - Explicit scope boundaries
50
+ - Stated assumptions
51
+ - Clear relationship to Level 1 concept
52
+
53
+ ### Level 3: Operationalization
54
+
55
+ The concrete procedure for measurement.
56
+
57
+ **Examples:**
58
+ - A benchmark scenario with known issues
59
+ - A judge prompt with scoring rubric
60
+ - A red-teaming protocol
61
+ - Human annotation guidelines
62
+
63
+ **Key questions:**
64
+ - Does this operationalization actually capture the Level 2 concept?
65
+ - What aspects of the concept does it miss?
66
+ - What confounding factors might it introduce?
67
+
68
+ ### Level 4: Scores/Indicators
69
+
70
+ The numeric outputs from applying the operationalization.
71
+
72
+ **Examples:**
73
+ - Detection score: 85/100
74
+ - Precision: 0.89
75
+ - Recall: 0.80
76
+ - Krippendorff's Alpha: 0.72
77
+
78
+ **Critical insight:** A score is only meaningful in relation to Levels 1-3. Without understanding what construct a benchmark measures, scores are uninterpretable.
79
+
80
+ ## Applying the Framework to Pennyfarthing
81
+
82
+ ### Judge Evaluation
83
+
84
+ | Level | Current State | Target State |
85
+ |-------|---------------|--------------|
86
+ | 1. Background Concept | "Agent quality" (vague) | Explicit decomposition into sub-constructs |
87
+ | 2. Systematized Concept | Implicit in rubric | Documented construct definitions |
88
+ | 3. Operationalization | Checklist rubric | Anchored rubric with behavioral examples |
89
+ | 4. Scores | 0-100 composite | Precision/recall + quality dimensions |
90
+
91
+ ### Benchmark Scenarios
92
+
93
+ | Level | Question to Answer |
94
+ |-------|-------------------|
95
+ | 1 | What broad capability are we trying to measure? |
96
+ | 2 | How do we scope that capability for this scenario type? |
97
+ | 3 | Does our scenario design actually test that capability? |
98
+ | 4 | What do the resulting scores tell us (and not tell us)? |
99
+
100
+ ## Validity Evidence
101
+
102
+ A valid evaluation requires multiple forms of evidence:
103
+
104
+ ### Content Validity
105
+ Does the operationalization cover the construct adequately?
106
+
107
+ **For Pennyfarthing benchmarks:**
108
+ - Do scenarios cover the range of situations the construct applies to?
109
+ - Are there important aspects of the construct not represented?
110
+
111
+ ### Construct Validity
112
+ Does the operationalization measure what it claims to measure?
113
+
114
+ **Tests:**
115
+ - Do scores correlate with other measures of the same construct?
116
+ - Do scores NOT correlate with measures of different constructs?
117
+ - Do known-good agents score higher than known-poor agents?
118
+
119
+ ### Criterion Validity
120
+ Do scores predict real-world outcomes?
121
+
122
+ **For Pennyfarthing:**
123
+ - Do high code-review scores predict fewer bugs in production?
124
+ - Do high test-writing scores predict better test coverage?
125
+
126
+ ### Reliability
127
+ Does the measurement produce consistent results?
128
+
129
+ **Metrics:**
130
+ - Inter-rater reliability (Krippendorff's Alpha for multi-judge)
131
+ - Test-retest reliability (same agent, same scenario, different runs)
132
+ - Internal consistency (do related items correlate?)
133
+
134
+ ## Anti-Patterns to Avoid
135
+
136
+ ### 1. Jumping to Level 4
137
+ **Problem:** Designing a benchmark without defining what it measures.
138
+ **Symptom:** "We have a score, but we're not sure what it means."
139
+ **Fix:** Start with Level 1-2 before designing operationalization.
140
+
141
+ ### 2. Conflating Operationalization with Construct
142
+ **Problem:** Treating the benchmark as the definition of quality.
143
+ **Symptom:** "A good agent is one that scores high on our benchmark."
144
+ **Fix:** Acknowledge benchmarks are imperfect proxies. Use multiple operationalizations.
145
+
146
+ ### 3. Ignoring Annotator Disagreement
147
+ **Problem:** Averaging away disagreement as "noise."
148
+ **Symptom:** Low Krippendorff's Alpha treated as measurement error.
149
+ **Fix:** Disagreement is signal about construct complexity. Investigate, don't suppress.
150
+
151
+ ### 4. Over-indexing on Single Metrics
152
+ **Problem:** Optimizing for one number.
153
+ **Symptom:** Agents that game benchmarks but fail in real use.
154
+ **Fix:** Use multiple metrics, understand what each measures.
155
+
156
+ ## Implementation in Pennyfarthing Benchmarks
157
+
158
+ ### Scenario Design Checklist
159
+
160
+ Before creating a new benchmark scenario:
161
+
162
+ - [ ] **Level 1 defined:** What broad concept does this scenario test?
163
+ - [ ] **Level 2 documented:** How is that concept scoped for this scenario?
164
+ - [ ] **Validity argument:** Why does this scenario test the claimed construct?
165
+ - [ ] **Known limitations:** What aspects of the construct does this NOT test?
166
+ - [ ] **Baseline established:** What is expected performance range?
167
+
168
+ ### Judge Rubric Checklist
169
+
170
+ Before using a scoring rubric:
171
+
172
+ - [ ] **Constructs explicit:** What does each dimension measure?
173
+ - [ ] **Anchors defined:** What behaviors correspond to each score level?
174
+ - [ ] **Reliability tested:** What is the inter-judge agreement?
175
+ - [ ] **Edge cases documented:** How should ambiguous situations be scored?
176
+
177
+ ### Results Interpretation Checklist
178
+
179
+ Before reporting benchmark results:
180
+
181
+ - [ ] **Context provided:** What construct was measured?
182
+ - [ ] **Limitations stated:** What does this score NOT tell us?
183
+ - [ ] **Confidence indicated:** How reliable is this measurement?
184
+ - [ ] **Comparisons valid:** Are we comparing like with like?
185
+
186
+ ## Relation to Benchmark Reliability Epics
187
+
188
+ The Benchmark Reliability initiative (epics 41-46) directly implements this framework:
189
+
190
+ | Epic | Framework Alignment |
191
+ |------|---------------------|
192
+ | 41: Precision/Recall Detection | Level 4 improvement: separate metrics for distinct constructs |
193
+ | 42: Anchored Rubric Criteria | Level 3 improvement: behavioral anchors reduce measurement variance |
194
+ | 43: False Positive Traps | Level 3 improvement: test construct validity with red herrings |
195
+ | 44: Multi-Judge Validation | Reliability evidence: measure inter-rater agreement |
196
+ | 45: Gold Standard References | Level 3 improvement: calibration anchors for consistent scoring |
197
+ | 46: Difficulty Profile | Level 2 improvement: multi-dimensional construct decomposition |
198
+
199
+ ## References
200
+
201
+ - Wallach, H., et al. (2025). Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge. ICML 2025. https://arxiv.org/abs/2502.00561
202
+ - Adcock, R., & Collier, D. (2001). Measurement validity: A shared standard for qualitative and quantitative research. APSR.
203
+ - Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin.
204
+ - Messick, S. (1995). Validity of psychological assessment. American Psychologist.
205
+ - HELM: Holistic Evaluation of Language Models. Stanford CRFM.
206
+ - ARC-AGI: A benchmark for measuring machine intelligence.
207
+
208
+ ## Changelog
209
+
210
+ - 2026-01-23: Initial version based on Wallach et al. (2025) ICML paper