npm - @pennyfarthing/core - Versions diffs - 7.4.1 → 7.6.0 - Mend

@pennyfarthing/core 7.4.1 → 7.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (127) hide show

package/pennyfarthing-dist/guides/XML-TAGS.md ADDED Viewed

@@ -0,0 +1,156 @@
+# XML Tag Taxonomy
+Pennyfarthing uses XML-style tags to structure agent definitions and skill documentation. These tags help LLMs identify and prioritize different types of content.
+## Priority Tags
+Tags that affect LLM behavior and attention.
+### `<critical>`
+**Purpose:** Non-negotiable rules that MUST be followed. LLMs should treat these as hard constraints.
+**Usage:** Gates, invariants, protocol requirements, things that break the system if ignored.
+```markdown
+<critical>
+**Never edit sprint YAML directly.** Use scripts.
+</critical>
+```
+**Examples:**
+- "Subagent output is NOT visible to Cyclist"
+- "NEVER mark acceptance criteria as complete" (for subagents)
+- "Write assessment BEFORE spawning handoff subagent"
+### `<gate>`
+**Purpose:** Prerequisites that MUST be verified before proceeding. Checklist-style validation.
+**Usage:** Entry/exit conditions for workflows, handoff requirements, quality gates.
+```markdown
+<gate>
+## Handoff Checklist
+1. Session file exists
+2. Acceptance criteria defined
+3. Feature branches created
+</gate>
+```
+**Difference from `<critical>`:** Gates are procedural checkpoints; critical items are invariant rules.
+### `<info>`
+**Purpose:** Contextual information that helps but doesn't constrain. Reference material.
+**Usage:** Background context, defaults, file locations, tips.
+```markdown
+<info>
+**Workflow:** SM → TEA → Dev → Reviewer → SM
+**Skills:** `/sprint`, `/jira`, `/testing`
+</info>
+```
+## Identity Tags
+Tags that define agent personality and role.
+### `<persona>`
+**Purpose:** Character personality from the active theme. Loaded at agent activation.
+**Usage:** Top of agent files, sets tone and style.
+```markdown
+<persona>
+Auto-loaded by `agent-session.sh start` from theme config.
+**Fallback if not loaded:** Supportive, methodical, detail-oriented
+</persona>
+```
+### `<role>`
+**Purpose:** Agent's position in the workflow and primary responsibility.
+**Usage:** Brief statement of what the agent does and when it's invoked.
+```markdown
+<role>
+Test specification, RED phase execution, handoff to Dev
+</role>
+```
+## Structure Tags
+Tags that organize agent content.
+### `<helpers>`
+**Purpose:** Describes Haiku subagents and their invocation pattern.
+**Usage:** Lists subagents, their purposes, and how to spawn them.
+### `<responsibilities>`
+**Purpose:** Bullet list of what this agent does vs delegates.
+### `<skills>`
+**Purpose:** Slash commands this agent commonly uses.
+### `<context>`
+**Purpose:** Guide files and sidecars to reference.
+### `<reasoning-mode>`
+**Purpose:** Verbose/quiet toggle for showing thought process.
+### `<on-activation>`
+**Purpose:** Startup checklist - what to do when agent is invoked.
+### `<exit>`
+**Purpose:** How to leave agent mode and cleanup.
+## Usage Guidelines
+1. **`<critical>` sparingly** - If everything is critical, nothing is. Reserve for true invariants.
+2. **`<gate>` for checkpoints** - Use when there's a clear pass/fail condition.
+3. **`<info>` generously** - Helpful context improves agent performance.
+4. **Order matters:**
+   ```
+   <persona>      # Who am I?
+   <role>         # What do I do?
+   <helpers>      # Who helps me?
+   <critical>     # What must I never violate?
+   <gate>         # What must I check?
+   <info>         # What's helpful to know?
+   ```
+5. **Close your tags** - Always use `</tag>` even though markdown parsers are lenient.
+## Tag Locations
+| Tag | Typical Location |
+|-----|------------------|
+| `<critical>` | Agent files, skill files, workflow instructions |
+| `<gate>` | Subagent files (handoff, finish, setup) |
+| `<info>` | Agent files, guide files |
+| `<persona>` | Agent files (top) |
+| `<role>` | Agent files (after persona) |
+## Adding New Tags
+Before adding a new tag type:
+1. Check if existing tags cover the use case
+2. Document the tag's purpose and priority level
+3. Update this file
+4. Be consistent across all files using the tag

package/pennyfarthing-dist/guides/agent-behavior.md CHANGED Viewed

@@ -7,7 +7,7 @@
 ## Critical Protocols
 <critical>
-**Reflector markers:** Subagent output is NOT visible to Cyclist. After handoff subagent returns `AGENT_COMMAND`, output the `marker` string verbatim. See `<agent-command-protocol>` below.
+**Reflector markers:** Run `handoff-marker.sh {next_agent}` as your ABSOLUTE LAST ACTION and output the result verbatim. See `<agent-exit-protocol>` below.
 </critical>
 <critical>
@@ -20,7 +20,7 @@ Multi-repo: `cd $CLAUDE_PROJECT_DIR/$(get_repo_path "$repo")` after sourcing `sc
 </critical>
 <critical>
-**Handoff Action:** When `handoff` returns `AGENT_COMMAND`, output `marker` verbatim then `fallback`. Don't ask permission.
+**Handoff Action:** After handoff subagent returns, run `handoff-marker.sh` as LAST action, output result, EXIT.
 </critical>
 <critical>
@@ -177,62 +177,88 @@ HTML comments that agents emit to signal Cyclist UI. Format: `<!-- CYCLIST:TYPE:
 ---
-<agent-command-protocol>
-## AGENT_COMMAND Protocol
+<agent-exit-protocol>
+## Agent Exit Protocol
-<critical>
-**Subagent output is NOT visible to Cyclist.** Tool results are not parsed for markers.
-Handoff subagents return an `AGENT_COMMAND` block with a pre-rendered `marker` string.
-The **calling agent** outputs the `marker` verbatim - no parsing or mapping required.
-</critical>
+### Exit Sequence
-### How It Works
+1. Write assessment to session file
+2. Spawn `handoff` subagent
+3. Await `HANDOFF_RESULT`
+4. If `status: blocked` → report error, stop
+5. **Run this as ABSOLUTE LAST ACTION:**
+   ```bash
+   $CLAUDE_PROJECT_DIR/.pennyfarthing/scripts/core/handoff-marker.sh {next_agent}
+   ```
+6. **Output the script result verbatim and EXIT**
-1. Agent writes assessment to session file FIRST
-2. Agent spawns `handoff` subagent
-3. Subagent runs `handoff-marker.sh {next-agent}` to generate `AGENT_COMMAND` block
-4. Subagent returns the block with pre-rendered `marker` string
-5. **Agent outputs `marker` verbatim, then outputs `fallback` message**
+### HANDOFF_RESULT Format
-**Single Source of Truth:** The `handoff-marker.sh` script is the authoritative source for marker format. It handles environment detection (IS_CYCLIST, USE_TIREPUMP) automatically.
+```
+HANDOFF_RESULT:
+  status: success|blocked
+  next_agent: {agent_name}
+  error: "{message}"  # if blocked
+```
-### AGENT_COMMAND Format
+### Script Output (emit verbatim)
 ```
 ---
 AGENT_COMMAND:
-  marker: "{PRE_RENDERED_MARKER_STRING}"
-  fallback: "Run `/{agent}` to continue"
+  marker: "<!-- CYCLIST:HANDOFF:/dev -->"
+  fallback: "Run `/dev` to continue"
 ---
 ```
-The `marker` field contains the exact string to output (or empty string if no marker needed).
-The `fallback` field contains human-readable instructions.
+**Nothing after the marker. EXIT.**
+</agent-exit-protocol>
-### Agent Action
+<wrong-phase-detection>
+## Wrong Phase Detection
-**Simple rule: Output `marker` then `fallback`. That's it.**
+When an agent detects the story is NOT in their phase, emit a marker immediately.
-1. If `error: true` → Report the `fallback` message as an error
-2. Otherwise → Output `marker` verbatim (if non-empty), then output `fallback`
+### How to Check (works with custom workflows)
-### Example
+1. Read `**Workflow:**` and `**Phase:**` from session file
+2. Query the phase owner:
+   ```bash
+   OWNER=$($CLAUDE_PROJECT_DIR/.pennyfarthing/scripts/core/run.sh workflow/phase-owner.sh {workflow} {phase})
+   ```
+3. If `$OWNER` != your agent name → story belongs to another agent
-Subagent returns:
-```
----
-AGENT_COMMAND:
-  marker: "<!-- CYCLIST:HANDOFF:/dev -->"
-  fallback: "Run `/dev` to continue"
----
+### Action When Not Your Phase
+```bash
+$CLAUDE_PROJECT_DIR/.pennyfarthing/scripts/core/handoff-marker.sh {OWNER}
 ```
-Agent outputs in their direct text (not a tool call):
+Then output the result verbatim. This triggers Cyclist's handoff button.
+### Example
+Dev reads session: `**Workflow:** tdd`, `**Phase:** review`
+```bash
+OWNER=$($CLAUDE_PROJECT_DIR/.pennyfarthing/scripts/core/run.sh workflow/phase-owner.sh tdd review)
+# Returns: reviewer
 ```
-<!-- CYCLIST:HANDOFF:/dev -->
-Run `/dev` to continue
+Since "reviewer" != "dev", Dev runs:
+```bash
+$CLAUDE_PROJECT_DIR/.pennyfarthing/scripts/core/handoff-marker.sh reviewer
 ```
-**CRITICAL:** The marker MUST appear in the agent's direct text output, not in a tool result.
-</agent-command-protocol>
+### Do NOT just say "run /reviewer"
+Wrong:
+> The story is in review. Run `/reviewer` to continue.
+Right:
+> The story is in review phase.
+>
+> <!-- CYCLIST:HANDOFF:/reviewer -->
+>
+> Run `/reviewer` to continue
+</wrong-phase-detection>

package/pennyfarthing-dist/guides/measurement-framework.md ADDED Viewed

@@ -0,0 +1,210 @@
+# Measurement Framework for AI Evaluation
+> Based on: Wallach et al. (2025). "Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge." ICML 2025.
+>
+> Paper: https://arxiv.org/abs/2502.00561
+## Overview
+This guide establishes Pennyfarthing's approach to rigorous AI evaluation, grounded in social science measurement theory. The core insight: **evaluating AI systems is fundamentally a measurement challenge**, not merely a technical benchmarking exercise.
+Current AI evaluation practices—single-metric benchmarks, leaderboard rankings—fail to capture the complexity of what we're actually trying to measure. Proper evaluation requires the same methodological rigor social scientists apply to measuring abstract constructs like intelligence, fairness, or trust.
+## The Four-Level Measurement Framework
+All Pennyfarthing evaluation should distinguish between these four levels:
+```
+Level 1: Background Concept
+    ↓
+Level 2: Systematized Concept
+    ↓
+Level 3: Operationalization
+    ↓
+Level 4: Scores/Indicators
+```
+### Level 1: Background Concept
+The broad, often contested idea we care about.
+**Examples in Pennyfarthing:**
+- "Code review quality"
+- "Test effectiveness"
+- "Agent helpfulness"
+- "Persona consistency"
+**Characteristics:**
+- Often vague or contested
+- Multiple valid interpretations exist
+- May mean different things to different stakeholders
+### Level 2: Systematized Concept
+A more precise definition scoped to the evaluation context.
+**Example:** "Code review quality" → "The degree to which a code review identifies genuine issues, provides actionable feedback, and avoids false positives, within a TypeScript codebase context."
+**Requirements:**
+- Explicit scope boundaries
+- Stated assumptions
+- Clear relationship to Level 1 concept
+### Level 3: Operationalization
+The concrete procedure for measurement.
+**Examples:**
+- A benchmark scenario with known issues
+- A judge prompt with scoring rubric
+- A red-teaming protocol
+- Human annotation guidelines
+**Key questions:**
+- Does this operationalization actually capture the Level 2 concept?
+- What aspects of the concept does it miss?
+- What confounding factors might it introduce?
+### Level 4: Scores/Indicators
+The numeric outputs from applying the operationalization.
+**Examples:**
+- Detection score: 85/100
+- Precision: 0.89
+- Recall: 0.80
+- Krippendorff's Alpha: 0.72
+**Critical insight:** A score is only meaningful in relation to Levels 1-3. Without understanding what construct a benchmark measures, scores are uninterpretable.
+## Applying the Framework to Pennyfarthing
+### Judge Evaluation
+| Level | Current State | Target State |
+|-------|---------------|--------------|
+| 1. Background Concept | "Agent quality" (vague) | Explicit decomposition into sub-constructs |
+| 2. Systematized Concept | Implicit in rubric | Documented construct definitions |
+| 3. Operationalization | Checklist rubric | Anchored rubric with behavioral examples |
+| 4. Scores | 0-100 composite | Precision/recall + quality dimensions |
+### Benchmark Scenarios
+| Level | Question to Answer |
+|-------|-------------------|
+| 1 | What broad capability are we trying to measure? |
+| 2 | How do we scope that capability for this scenario type? |
+| 3 | Does our scenario design actually test that capability? |
+| 4 | What do the resulting scores tell us (and not tell us)? |
+## Validity Evidence
+A valid evaluation requires multiple forms of evidence:
+### Content Validity
+Does the operationalization cover the construct adequately?
+**For Pennyfarthing benchmarks:**
+- Do scenarios cover the range of situations the construct applies to?
+- Are there important aspects of the construct not represented?
+### Construct Validity
+Does the operationalization measure what it claims to measure?
+**Tests:**
+- Do scores correlate with other measures of the same construct?
+- Do scores NOT correlate with measures of different constructs?
+- Do known-good agents score higher than known-poor agents?
+### Criterion Validity
+Do scores predict real-world outcomes?
+**For Pennyfarthing:**
+- Do high code-review scores predict fewer bugs in production?
+- Do high test-writing scores predict better test coverage?
+### Reliability
+Does the measurement produce consistent results?
+**Metrics:**
+- Inter-rater reliability (Krippendorff's Alpha for multi-judge)
+- Test-retest reliability (same agent, same scenario, different runs)
+- Internal consistency (do related items correlate?)
+## Anti-Patterns to Avoid
+### 1. Jumping to Level 4
+**Problem:** Designing a benchmark without defining what it measures.
+**Symptom:** "We have a score, but we're not sure what it means."
+**Fix:** Start with Level 1-2 before designing operationalization.
+### 2. Conflating Operationalization with Construct
+**Problem:** Treating the benchmark as the definition of quality.
+**Symptom:** "A good agent is one that scores high on our benchmark."
+**Fix:** Acknowledge benchmarks are imperfect proxies. Use multiple operationalizations.
+### 3. Ignoring Annotator Disagreement
+**Problem:** Averaging away disagreement as "noise."
+**Symptom:** Low Krippendorff's Alpha treated as measurement error.
+**Fix:** Disagreement is signal about construct complexity. Investigate, don't suppress.
+### 4. Over-indexing on Single Metrics
+**Problem:** Optimizing for one number.
+**Symptom:** Agents that game benchmarks but fail in real use.
+**Fix:** Use multiple metrics, understand what each measures.
+## Implementation in Pennyfarthing Benchmarks
+### Scenario Design Checklist
+Before creating a new benchmark scenario:
+- [ ] **Level 1 defined:** What broad concept does this scenario test?
+- [ ] **Level 2 documented:** How is that concept scoped for this scenario?
+- [ ] **Validity argument:** Why does this scenario test the claimed construct?
+- [ ] **Known limitations:** What aspects of the construct does this NOT test?
+- [ ] **Baseline established:** What is expected performance range?
+### Judge Rubric Checklist
+Before using a scoring rubric:
+- [ ] **Constructs explicit:** What does each dimension measure?
+- [ ] **Anchors defined:** What behaviors correspond to each score level?
+- [ ] **Reliability tested:** What is the inter-judge agreement?
+- [ ] **Edge cases documented:** How should ambiguous situations be scored?
+### Results Interpretation Checklist
+Before reporting benchmark results:
+- [ ] **Context provided:** What construct was measured?
+- [ ] **Limitations stated:** What does this score NOT tell us?
+- [ ] **Confidence indicated:** How reliable is this measurement?
+- [ ] **Comparisons valid:** Are we comparing like with like?
+## Relation to Benchmark Reliability Epics
+The Benchmark Reliability initiative (epics 41-46) directly implements this framework:
+| Epic | Framework Alignment |
+|------|---------------------|
+| 41: Precision/Recall Detection | Level 4 improvement: separate metrics for distinct constructs |
+| 42: Anchored Rubric Criteria | Level 3 improvement: behavioral anchors reduce measurement variance |
+| 43: False Positive Traps | Level 3 improvement: test construct validity with red herrings |
+| 44: Multi-Judge Validation | Reliability evidence: measure inter-rater agreement |
+| 45: Gold Standard References | Level 3 improvement: calibration anchors for consistent scoring |
+| 46: Difficulty Profile | Level 2 improvement: multi-dimensional construct decomposition |
+## References
+- Wallach, H., et al. (2025). Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge. ICML 2025. https://arxiv.org/abs/2502.00561
+- Adcock, R., & Collier, D. (2001). Measurement validity: A shared standard for qualitative and quantitative research. APSR.
+- Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin.
+- Messick, S. (1995). Validity of psychological assessment. American Psychologist.
+- HELM: Holistic Evaluation of Language Models. Stanford CRFM.
+- ARC-AGI: A benchmark for measuring machine intelligence.
+## Changelog
+- 2026-01-23: Initial version based on Wallach et al. (2025) ICML paper