npm - @pennyfarthing/core - Versions diffs - 6.6.0 - Mend

@pennyfarthing/core 6.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (599) hide show

package/pennyfarthing-dist/skills/judge/SKILL.md ADDED Viewed

@@ -0,0 +1,524 @@
+---
+name: judge
+description: Evaluate agent responses using standardized rubrics. Invoke with mode and response data.
+---
+# Judge Skill
+Canonical evaluation of agent responses. All judging goes through this skill.
+## Invocation
+```
+/judge --mode <mode> --data <json>
+```
+**Modes:**
+- `solo` - Single response, absolute rubric (or checklist if baseline_issues provided)
+- `compare` - Two responses, comparative rubric
+- `phase-sm` - Relay SM phase rubric
+- `phase-tea` - Relay TEA phase rubric
+- `phase-dev` - Relay Dev phase rubric
+- `phase-reviewer` - Relay Reviewer phase rubric
+- `coherence` - Relay chain coherence rating
+## Unified Rubric (solo/compare)
+| Dimension | Weight | Criteria |
+|-----------|--------|----------|
+| **Correctness** | 25% | Technical accuracy. Right issues? Valid solutions? |
+| **Depth** | 25% | Thoroughness. Root causes? Implications? |
+| **Quality** | 25% | Clarity and actionability. Organized? Useful? |
+| **Persona** | 25% | Character embodiment. Consistent? Added value? |
+**Formula:** `(correctness × 2.5) + (depth × 2.5) + (quality × 2.5) + (persona × 2.5) = WEIGHTED_TOTAL`
+## Relay Phase Rubrics
+### SM Phase
+| Dimension | Weight |
+|-----------|--------|
+| Clarity | 30% |
+| Handoff | 40% |
+| Completeness | 30% |
+### TEA Phase
+| Dimension | Weight |
+|-----------|--------|
+| Coverage | 35% |
+| RED State | 35% |
+| Handoff | 30% |
+### Dev Phase
+| Dimension | Weight |
+|-----------|--------|
+| GREEN State | 40% |
+| Code Quality | 30% |
+| Handoff | 30% |
+### Reviewer Phase
+| Dimension | Weight |
+|-----------|--------|
+| Detection | 40% |
+| Verdict | 30% |
+| Persona | 30% |
+### Chain Coherence
+| Rating | Multiplier |
+|--------|------------|
+| excellent | 1.2x |
+| good | 1.0x |
+| poor | 0.8x |
+## On Invoke
+### Step 1: Parse Arguments
+Extract:
+- `mode`: One of the modes listed above
+- `data`: JSON object with required fields for that mode
+**Data requirements by mode:**
+| Mode | Required Fields | Optional Fields |
+|------|-----------------|-----------------|
+| solo | `spec`, `character`, `challenge`, `response` | `code`, `baseline_issues`, `baseline_criteria`, `bonus_issues`, `bonus_criteria` |
+| compare | `contestants[]` (each with spec, character, response), `challenge` | `baseline_issues`, `baseline_criteria` |
+| phase-* | `team1`, `team2` (each with theme, response), `context` | |
+| coherence | `theme`, `sm_response`, `tea_response`, `dev_response`, `reviewer_response` | |
+**Note:** When checklist data is provided, solo mode uses checklist-based evaluation:
+- `baseline_issues` → code-review, tea, dev scenarios (things to FIND)
+- `baseline_criteria` → SM scenarios (behaviors to DEMONSTRATE)
+- `bonus_issues` / `bonus_criteria` → Extra credit items (optional)
+### Step 2: Build Judge Prompt
+Based on mode, construct the appropriate prompt:
+#### Solo Mode Prompt
+**If NO baseline_issues provided, use generic rubric:**
+```
+You are an impartial judge evaluating an AI agent's response.
+## Contestant
+- **{spec}** ({character})
+## Challenge
+{challenge}
+## Response
+{response}
+## Evaluation
+Score 1-10 on each dimension:
+1. **Correctness (25%)** - Technical accuracy
+2. **Depth (25%)** - Thoroughness
+3. **Quality (25%)** - Clarity and actionability
+4. **Persona (25%)** - Character embodiment
+Formula: (correctness × 2.5) + (depth × 2.5) + (quality × 2.5) + (persona × 2.5) = WEIGHTED_TOTAL
+**IMPORTANT: Output your evaluation as JSON only. No markdown, no extra text.**
+```json
+{
+  "scores": {
+    "correctness": { "value": 8, "reasoning": "..." },
+    "depth": { "value": 7, "reasoning": "..." },
+    "quality": { "value": 9, "reasoning": "..." },
+    "persona": { "value": 8, "reasoning": "..." }
+  },
+  "weighted_total": 80.0,
+  "assessment": "2-3 sentence overall assessment"
+}
+```
+```
+**If baseline_issues IS provided, use checklist rubric (v2 - precision/recall):**
+```
+You are an impartial judge evaluating an AI agent's response against a checklist of expected findings.
+## Contestant
+- **{spec}** ({character})
+## Challenge
+{challenge}
+{if code provided}
+## Code Under Review
+{code}
+{endif}
+## Expected Findings
+Below are the known issues/requirements. Severity indicates weight:
+- CRITICAL: weight 15 (must find)
+- HIGH: weight 10 (should find)
+- MEDIUM: weight 5 (good to find)
+- LOW: weight 2 (bonus)
+- (unlabeled categories like happy_path, validation: weight 5 each)
+{baseline_issues formatted as checklist}
+## Response to Evaluate
+{response}
+## Evaluation Instructions
+Evaluate the response and output ONLY valid JSON (no markdown, no extra text):
+```json
+{
+  "baseline_findings": [
+    {"id": "ISSUE_ID", "severity": "critical|high|medium|low", "found": true, "evidence": "quote or null"}
+  ],
+  "novel_findings": [
+    {"description": "...", "valid": true, "reasoning": "..."}
+  ],
+  "false_positives": [
+    {"claim": "...", "why_invalid": "..."}
+  ],
+  "detection": {
+    "by_severity": {
+      "critical": {"found": 5, "total": 6},
+      "high": {"found": 4, "total": 6},
+      "medium": {"found": 3, "total": 8},
+      "low": {"found": 1, "total": 2}
+    },
+    "novel_valid": 2,
+    "false_positive_count": 1,
+    "metrics": {
+      "weighted_found": 98,
+      "weighted_total": 120,
+      "recall": 0.817,
+      "precision": 0.929,
+      "f2_score": 0.843
+    },
+    "components": {
+      "recall_score": 24.5,
+      "precision_score": 9.3,
+      "novel_bonus": 6.0
+    },
+    "subtotal": 39.8
+  },
+  "quality": {
+    "clear_explanations": 8,
+    "actionable_fixes": 7,
+    "subtotal": 18.75
+  },
+  "persona": {
+    "in_character": 9,
+    "professional_tone": 8,
+    "subtotal": 21.25
+  },
+  "weighted_total": 79.8,
+  "assessment": "2-3 sentence summary of strengths and gaps"
+}
+```
+## Detection Scoring Rules (v2 - Precision/Recall)
+**Severity Weights:**
+- critical: 15, high: 10, medium: 5, low: 2
+**Metric Calculations:**
+```
+weighted_found = Σ(found_issues × severity_weight)
+weighted_total = Σ(all_baseline_issues × severity_weight)
+recall = weighted_found / weighted_total
+precision = true_positives / (true_positives + false_positives)
+f2_score = 5 × (precision × recall) / (4 × precision + recall)
+```
+**Component Scores (Detection = 50 max):**
+```
+recall_score = recall × 30          # max 30 pts - coverage matters most
+precision_score = precision × 10    # max 10 pts - penalizes hallucinations
+novel_bonus = min(novel_valid × 3, 10)  # max 10 pts - rewards thoroughness
+detection.subtotal = recall_score + precision_score + novel_bonus
+```
+**Why this design:**
+- **Recall weighted 3x precision**: Missing a critical vulnerability is worse than a false positive
+- **Severity-weighted recall**: Finding 5 critical issues > finding 5 low issues
+- **Separate novel bonus**: Rewards thoroughness beyond baseline without affecting precision
+- **Visible metrics**: recall, precision, f2_score all reported for transparency
+**Example Calculations:**
+```
+Scenario: 6 critical (90 pts), 6 high (60 pts), 8 medium (40 pts), 2 low (4 pts) = 194 weighted total
+Agent finds: 5 critical, 4 high, 3 medium, 1 low = 75+40+15+2 = 132 weighted found
+Agent flags: 14 true positives, 1 false positive, 2 valid novel findings
+recall = 132/194 = 0.680
+precision = 14/15 = 0.933
+f2_score = 5 × (0.933 × 0.680) / (4 × 0.933 + 0.680) = 0.718
+recall_score = 0.680 × 30 = 20.4
+precision_score = 0.933 × 10 = 9.3
+novel_bonus = min(2 × 3, 10) = 6.0
+detection.subtotal = 20.4 + 9.3 + 6.0 = 35.7
+```
+**Other Dimensions:**
+- Quality (25 max): (clear_explanations/10 × 12.5) + (actionable_fixes/10 × 12.5)
+- Persona (25 max): (in_character/10 × 12.5) + (professional_tone/10 × 12.5)
+- weighted_total = detection.subtotal + quality.subtotal + persona.subtotal
+```
+**Checklist Scoring Notes:**
+- **Recall dominates** (30/50 pts): Comprehensive coverage is primary goal
+- **Precision matters** (10/50 pts): Penalizes hallucinated issues proportionally
+- **Novel findings rewarded** (10/50 pts): Encourages going beyond baseline
+- **Severity-weighted**: Critical issues count 7.5x more than low issues
+- **Transparent metrics**: All intermediate values visible for debugging
+- Quality/Persona still matter (25% each) - not just about finding issues
+**If baseline_criteria IS provided (SM scenarios), use behavior checklist:**
+```
+You are an impartial judge evaluating an AI agent's facilitation/management response.
+## Contestant
+- **{spec}** ({character})
+## Challenge
+{challenge}
+## Expected Behaviors
+Below are the behaviors a good response should demonstrate:
+**BASELINE CRITERIA (5 pts each):**
+{baseline_criteria formatted by category}
+**BONUS CRITERIA (3 pts each, if present):**
+{bonus_criteria formatted, or "None specified"}
+## Response to Evaluate
+{response}
+## Evaluation Instructions
+Evaluate the response and output ONLY valid JSON (no markdown, no extra text):
+```json
+{
+  "baseline_behaviors": [
+    {"id": "BEHAVIOR_ID", "category": "...", "demonstrated": true, "evidence": "quote or null"}
+  ],
+  "bonus_behaviors": [
+    {"id": "BONUS_ID", "category": "...", "demonstrated": true, "evidence": "quote or null"}
+  ],
+  "execution": {
+    "baseline_count": 8,
+    "bonus_count": 2,
+    "subtotal": 46
+  },
+  "quality": {
+    "clear_actionable": 8,
+    "well_structured": 7,
+    "subtotal": 18.75
+  },
+  "persona": {
+    "in_character": 9,
+    "enhances_delivery": 8,
+    "subtotal": 21.25
+  },
+  "weighted_total": 86.0,
+  "assessment": "2-3 sentence summary of facilitation effectiveness"
+}
+```
+Scoring rules:
+- Execution (50 max): baseline×5 (cap 40) + bonus×3 (cap 10)
+- Quality (25 max): (clear_actionable/10 × 12.5) + (well_structured/10 × 12.5)
+- Persona (25 max): (in_character/10 × 12.5) + (enhances_delivery/10 × 12.5)
+- weighted_total = execution.subtotal + quality.subtotal + persona.subtotal
+```
+#### Compare Mode Prompt
+```
+You are an impartial judge comparing two AI personas.
+## Contestants
+- **{spec1}** ({character1})
+- **{spec2}** ({character2})
+## Challenge
+{challenge}
+## Response from {character1}
+{response1}
+## Response from {character2}
+{response2}
+## Evaluation
+Score both on each dimension (1-10). Output ONLY valid JSON (no markdown, no extra text):
+```json
+{
+  "contestants": {
+    "{spec1}": {
+      "scores": {
+        "correctness": { "value": 8, "reasoning": "..." },
+        "depth": { "value": 7, "reasoning": "..." },
+        "quality": { "value": 9, "reasoning": "..." },
+        "persona": { "value": 8, "reasoning": "..." }
+      },
+      "weighted_total": 80.0
+    },
+    "{spec2}": {
+      "scores": {
+        "correctness": { "value": 7, "reasoning": "..." },
+        "depth": { "value": 8, "reasoning": "..." },
+        "quality": { "value": 7, "reasoning": "..." },
+        "persona": { "value": 9, "reasoning": "..." }
+      },
+      "weighted_total": 77.5
+    }
+  },
+  "winner": "{spec1}",
+  "justification": "Brief explanation of why winner was chosen"
+}
+```
+```
+#### Phase Mode Prompts
+Use phase-specific rubrics from tables above. Evaluate both teams. Output JSON format.
+#### Coherence Mode Prompt
+```
+Evaluate chain coherence for {theme}.
+## Chain
+SM: {sm_response}
+TEA: {tea_response}
+Dev: {dev_response}
+Reviewer: {reviewer_response}
+Output ONLY valid JSON (no markdown, no extra text):
+```json
+{
+  "rating": "excellent|good|poor",
+  "reasoning": "explanation of coherence assessment"
+}
+```
+```
+### Step 3: Execute Judge via CLI
+**CRITICAL: Follow this execution pattern for all contexts (main session, skills, subagents).**
+**Three rules to avoid shell parsing errors:**
+1. **Use Write tool for prompt files** - NOT `echo` in Bash (handles special characters)
+2. **Use file redirection for output** - NOT variable capture `$(...)` (avoids zsh parse errors)
+3. **Use pipe syntax** - NOT heredocs (works in subagents)
+**Why variable capture fails:**
+```bash
+# This FAILS - zsh tries to parse JSON with () characters
+OUTPUT=$(cat prompt.txt | claude -p --output-format json --tools "")
+# Error: parse error near ')'
+```
+**Correct pattern:**
+```bash
+# Step 1: Use Write tool to create prompt file (NOT echo in Bash)
+# The Write tool handles escaping properly in all contexts
+# Step 2: Capture timestamp (simple command, safe to capture)
+date -u +%Y-%m-%dT%H:%M:%SZ > .scratch/judge_ts.txt
+# Step 3: Execute with FILE REDIRECTION (NOT variable capture)
+cat .scratch/judge_prompt.txt | claude -p --output-format json --tools "" > .scratch/judge_output.json
+# Step 4: Extract from files (reading files is always safe)
+JUDGE_RESPONSE=$(jq -r '.result' .scratch/judge_output.json)
+JUDGE_INPUT_TOKENS=$(jq -r '.usage.input_tokens // 0' .scratch/judge_output.json)
+JUDGE_OUTPUT_TOKENS=$(jq -r '.usage.output_tokens // 0' .scratch/judge_output.json)
+```
+**Key insight:** The shell never parses the JSON when using file redirection.
+The output goes directly to a file, then jq reads it safely.
+### Step 4: Extract Scores
+```bash
+# All modes now output JSON - parse with jq
+# Solo mode
+SCORE=$(echo "$JUDGE_RESPONSE" | jq -r '.weighted_total // empty')
+# Compare mode
+SCORE1=$(echo "$JUDGE_RESPONSE" | jq -r '.contestants["{spec1}"].weighted_total // empty')
+SCORE2=$(echo "$JUDGE_RESPONSE" | jq -r '.contestants["{spec2}"].weighted_total // empty')
+WINNER=$(echo "$JUDGE_RESPONSE" | jq -r '.winner // empty')
+# Coherence mode
+RATING=$(echo "$JUDGE_RESPONSE" | jq -r '.rating // empty')
+# Fallback: try grep if JSON parsing fails (backwards compatibility)
+if [[ -z "$SCORE" ]]; then
+  SCORE=$(echo "$JUDGE_RESPONSE" | grep -oE "weighted_total[\"':]*\s*([0-9.]+)" | grep -oE "[0-9.]+" | tail -1)
+fi
+```
+### Step 5: Validate Results
+| Check | Requirement |
+|-------|-------------|
+| `JUDGE_TIMESTAMP` | Valid ISO8601 |
+| `JUDGE_RESPONSE` | At least 200 chars |
+| `SCORE` (if applicable) | Number 1-100 |
+| `RATING` (if coherence) | One of: excellent, good, poor |
+| `JUDGE_INPUT_TOKENS` | > 0 |
+| `JUDGE_OUTPUT_TOKENS` | > 0 |
+**If validation fails:** Return error, do NOT estimate.
+### Step 6: Return Results
+Output structured result for caller:
+```json
+{
+  "success": true,
+  "mode": "{mode}",
+  "timestamp": "{JUDGE_TIMESTAMP}",
+  "scores": {
+    "{spec1}": {score1},
+    "{spec2}": {score2}
+  },
+  "winner": "{winner_spec}",
+  "token_usage": {
+    "input": {JUDGE_INPUT_TOKENS},
+    "output": {JUDGE_OUTPUT_TOKENS}
+  },
+  "response_text": "{JUDGE_RESPONSE}"
+}
+```
+## Error Handling
+```
+❌ Judge validation failed: {reason}
+❌ Mode: {mode}
+❌ DO NOT estimate scores
+```

package/pennyfarthing-dist/skills/just/SKILL.md ADDED Viewed

@@ -0,0 +1,160 @@
+---
+name: just-runner
+description: Run just recipes for project tasks. This skill should be used when starting dev servers, running tests, managing databases, checking project health, or writing new justfile recipes.
+---
+# Just Command Runner Skill
+## When to Use This Skill
+- Starting/stopping development servers
+- Running tests
+- Managing databases (start, stop, reset, seed)
+- Checking service health/status
+- Writing new justfile recipes
+## Overview
+`just` is a command runner (like `make` but simpler). Commands run from the **project root** unless otherwise specified.
+**Installation:** `brew install just` or `cargo install just`
+## Getting Help
+```bash
+just help      # Show categorized commands with descriptions
+just --list    # List all recipes without descriptions
+just --show <recipe>  # Show recipe definition
+```
+## Common Recipe Patterns
+### Development
+```bash
+just dev           # Start development environment
+just dev-start     # Start all services
+just dev-stop      # Stop all services
+just dev-status    # Check service status
+just dev-logs      # Tail logs
+```
+### Testing
+```bash
+just test          # Run all tests
+just test-setup    # Setup test infrastructure
+just test-api      # Backend tests only
+just test-ui       # Frontend tests only
+```
+### Database
+```bash
+just db-start      # Start database services
+just db-stop       # Stop database services
+just db-reset      # Reset databases (DESTRUCTIVE)
+just db-seed       # Seed with test data
+```
+## Passing Arguments
+Just recipes accept arguments directly (NO `--` separator needed):
+```bash
+# Correct
+just test-api -run TestName ./...
+just build --release
+# WRONG - don't use --
+just test-api -- -run TestName
+```
+## Writing Custom Recipes
+### Basic Recipe
+```just
+# Recipe with description (shows in `just --list`)
+hello:
+    echo "Hello, world!"
+# Private recipe (doesn't show in list)
+[private]
+_helper:
+    echo "I'm hidden"
+```
+### Recipe with Arguments
+```just
+# Positional arguments
+greet name:
+    echo "Hello, {{name}}!"
+# With defaults
+greet name="World":
+    echo "Hello, {{name}}!"
+# Variadic arguments
+test *args:
+    go test {{args}}
+```
+### Multi-line Scripts
+```just
+# Use shebang for complex scripts
+build:
+    #!/usr/bin/env bash
+    set -euo pipefail
+    echo "Building..."
+    go build -o bin/app ./cmd/app
+    echo "Done!"
+```
+### Variables and Interpolation
+```just
+# Set variables
+project := "myapp"
+version := `git describe --tags`
+# Use in recipes
+info:
+    echo "{{project}} version {{version}}"
+# Environment variables
+export DATABASE_URL := "postgres://localhost/dev"
+```
+### Dependencies
+```just
+# Run `setup` before `build`
+build: setup
+    go build ./...
+# Multiple dependencies
+deploy: build test
+    ./deploy.sh
+```
+### Conditionals
+```just
+# Platform-specific
+install:
+    {{ if os() == "macos" }} brew install foo {{ else }} apt install foo {{ endif }}
+```
+## Best Practices
+1. **Use `echo -e`** for colored output (not plain `echo`)
+2. **Use `#!/usr/bin/env bash`** shebang for multi-line scripts
+3. **Mark internal recipes as `[private]`**
+4. **Use `{{var}}`** for variable interpolation, not `$var`
+5. **Extract complex logic** to scripts in `scripts/` directory
+6. **Add descriptions** to all public recipes
+## Reference Documentation
+- **Justfile Syntax:** See `references/justfile-syntax.md`
+- **Official Docs:** https://just.systems/man/en/