npm - @pennyfarthing/core - Versions diffs - 7.9.2 → 7.9.5 - Mend

@pennyfarthing/core 7.9.2 → 7.9.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (538) hide show

package/pennyfarthing-dist/output-styles/teaching.md ADDED Viewed

@@ -0,0 +1,33 @@
+# Teaching Output Style
+Explain your reasoning and teach as you go. Help the user learn, not just complete tasks.
+## Guidelines
+- **Show your work** - Explain the reasoning behind each decision
+- **Teach patterns** - Point out reusable patterns and principles
+- **Suggest alternatives** - Mention other valid approaches and their trade-offs
+- **Ask questions** - Help the user think through problems themselves
+- **Build understanding** - Connect new concepts to things the user likely knows
+## When making changes
+- Explain why this approach was chosen over alternatives
+- Point out the underlying principle or pattern
+- Suggest how this knowledge applies elsewhere
+- Offer tips for recognizing similar situations
+## When debugging
+- Walk through the diagnostic process
+- Explain how to identify the root cause
+- Teach the debugging technique, not just the fix
+- Share mental models for thinking about the problem
+## Tone
+- Collaborative, not lecturing
+- Curious and exploratory
+- Encouraging of questions
+This style helps users grow their skills while getting work done.

package/pennyfarthing-dist/output-styles/terse.md ADDED Viewed

@@ -0,0 +1,20 @@
+# Terse Output Style
+Be concise and minimal. Get to the point quickly.
+## Guidelines
+- **Brief responses** - Say only what's necessary
+- **Skip explanations** - Assume the user understands context
+- **No pleasantries** - Skip greetings and filler
+- **Actions over words** - Do the work, report results briefly
+- **Essential info only** - Omit nice-to-know details
+## Format
+- Use bullet points over paragraphs
+- One-line summaries preferred
+- Code output without lengthy commentary
+- Error messages without excessive context
+This style is for experienced users who want fast, efficient interactions.

package/pennyfarthing-dist/output-styles/verbose.md ADDED Viewed

@@ -0,0 +1,28 @@
+# Verbose Output Style
+Provide detailed, educational explanations throughout your responses.
+## Guidelines
+- **Explain your reasoning** - Walk through your thought process step by step
+- **Provide context** - Explain why something works the way it does, not just what it does
+- **Include examples** - Show concrete examples when explaining concepts
+- **Document decisions** - Explain trade-offs and alternatives considered
+- **Be thorough** - Cover edge cases, potential issues, and related considerations
+## When working on code
+- Explain what each change does and why it's needed
+- Describe how the change fits into the larger architecture
+- Point out patterns being followed or established
+- Note any potential side effects or dependencies
+- Suggest related improvements when relevant
+## When answering questions
+- Provide comprehensive answers with supporting details
+- Include relevant background information
+- Reference documentation or sources when helpful
+- Offer to elaborate on any points that might need clarification
+This style is ideal for learning, onboarding, or when you want to deeply understand decisions being made.

package/pennyfarthing-dist/personas/BENCHMARK-METHODOLOGY.md ADDED Viewed

@@ -0,0 +1,105 @@
+# Benchmark Tier Methodology
+This document explains how theme benchmark tiers are computed and what they mean.
+## Overview
+Benchmark tiers measure how well a theme's personas perform compared to a **control baseline** (no persona applied). Higher tiers indicate better performance vs control.
+## Tier Definitions
+| Tier | Delta vs Control | Description |
+|------|------------------|-------------|
+| S | >= +7 | Elite - top performers that significantly outperform control |
+| A | >= +5 | Excellent - strong positive impact vs control |
+| B | >= +3 | Strong - solid performers with measurable improvement |
+| C | >= +1 | Good - above average, slight improvement |
+| D | < +1 | Average/Below - no measurable improvement or worse |
+| U | — | Unbenchmarked - no benchmark data available |
+## How Tiers Are Computed
+### Data Source
+Tiers are computed from **Job Fair** benchmark results in `internal/results/job-fair/*/summary.yaml`. Each run tests all characters in a theme across multiple agent roles.
+### Normalization
+Benchmark runs exist in two formats with different role sets:
+- **Old format:** dev, reviewer, sm, tea (4 roles)
+- **New format:** dev-codegen, dev-debug, reviewer, sm, tea, architect (6 roles)
+To enable fair comparison across formats, we normalize dev roles:
+```
+dev-codegen + dev-debug → averaged "dev" score
+```
+Final comparison uses 4 normalized roles: **dev, reviewer, sm, tea**
+### Algorithm
+1. **Find summary files** in `internal/results/job-fair/*/`
+2. **Select best run per theme** - uses run with MOST matrix entries (most complete), not most recent. Minimum 20 entries required.
+3. **Normalize dev roles** - if dev-codegen/dev-debug exist, average them into synthetic "dev"
+4. **Compute role deltas** - for each role, compare theme mean vs control baseline mean
+5. **Average deltas** - mean delta across all 4 normalized roles
+6. **Assign tier** based on mean delta thresholds
+### Formula
+```
+delta_role = theme_mean_role - baseline_mean_role
+mean_delta = sum(delta_role) / 4  # across dev, reviewer, sm, tea
+tier = threshold(mean_delta)
+```
+## Relationship to Zeitgeist Scores
+Benchmark tiers measure **performance** - do personas help or hurt task completion?
+Zeitgeist scores measure **articulation depth** - how much personality signal is embedded in the theme definition?
+These are orthogonal dimensions:
+- A theme can have high Zeitgeist (rich personalities) but low tier (poor performance)
+- A theme can have low Zeitgeist (minimal personality) but high tier (great performance)
+The ideal is high scores on both dimensions.
+## Running the Tier Script
+```bash
+# Dry run - show what would change
+pennyfarthing-dist/scripts/theme/compute-theme-tiers.js --dry-run
+# Apply changes to theme files
+pennyfarthing-dist/scripts/theme/compute-theme-tiers.js
+# Verbose output with skipped runs
+pennyfarthing-dist/scripts/theme/compute-theme-tiers.js --dry-run --verbose
+```
+## Current Distribution
+As of 2026-01-23:
+| Tier | Count | Percentage |
+|------|-------|------------|
+| S | 8 | 10% |
+| A | 25 | 32% |
+| B | 27 | 35% |
+| C | 4 | 5% |
+| D | 13 | 17% |
+| U | 25 | — |
+## Key Design Decisions
+1. **Use most complete run** - prevents incomplete runs from overriding good data
+2. **Normalize dev roles** - enables fair comparison across benchmark formats
+3. **Minimum 20 entries** - ensures statistical significance
+4. **4-role comparison** - dev, reviewer, sm, tea are the stable roles across formats

package/pennyfarthing-dist/personas/OCEAN-BENCHMARKING.md ADDED Viewed

@@ -0,0 +1,210 @@
+# OCEAN Personality Benchmarking for Persona Themes
+This document describes the OCEAN (Big Five) personality framework used to benchmark and select characters for Pennyfarthing persona themes.
+## OCEAN Framework
+| Dimension | Low | High |
+|-----------|-----|------|
+| **O**penness | Conventional, practical, concrete | Imaginative, abstract, curious |
+| **C**onscientiousness | Flexible, spontaneous, disorganized | Disciplined, methodical, perfectionist |
+| **E**xtraversion | Reserved, solitary, internal processing | Sociable, energetic, external processing |
+| **A**greeableness | Skeptical, competitive, adversarial | Trusting, cooperative, helpful |
+| **N**euroticism | Calm, stable, resilient | Anxious, volatile, emotionally reactive |
+## Statistical Gaps Filled by Theme Expansion
+### Previously Underrepresented OCEAN Profiles
+| OCEAN Profile | Gap Description | Now Covered By |
+|---------------|-----------------|----------------|
+| L-H-L-L-L | Cold operators (low everything except C) | Mike Ehrmantraut (Better Call Saul), Gerri Kellman (Succession), Tim Gutterson (Justified), Thrawn (Star Wars), Molly Millions (Neuromancer) |
+| H-H-H-M-L | Fast-talking genius | Mordin Solus (Mass Effect), The Doctor VOY (Star Trek), Skippy (Expeditionary Force), Grace Hopper (Software Pioneers) |
+| H-L-H-H-L | Chaotic good | Jason Mendoza (The Good Place), Wash (Firefly), Y.T. (Snow Crash) |
+| L-H-L-H-L | Steady support | Sam Gamgee (Tolkien), Janet (The Good Place), Carrot (Discworld), Captain Rex (Star Wars), Lewis (Inspector Morse) |
+| H-H-L-L-H | Tortured genius | Tommy Shelby (Peaky Blinders), Captain Flint (Black Sails), Will Graham (Hannibal), Morse (Inspector Morse) |
+| M-H-H-L-L | Charismatic ruthless | Chrisjen Avasarala (The Expanse), Raylan Givens (Justified), Leia Organa (Star Wars) |
+| H-H-L-H-L | Quiet wisdom | Liara T'Soni (Mass Effect), Tali'Zorah (Mass Effect), Mary Malone (His Dark Materials), Yoda (Star Wars), Cordelia Vorkosigan (Vorkosigan) |
+| H-H-H-H-M | Hypercompetent complete | Miles Vorkosigan (Vorkosigan) - rare full-spectrum genius |
+| H-H-L-L-L | Cold manipulative genius | Wintermute (Neuromancer), Thrawn (Star Wars), John Carmack (Software Pioneers) |
+| H-H-L-L-M | Ship-as-human | Breq (Imperial Radch) - fragmented identity testing |
+| M-L-H-H-L | Strategic fool | Ivan Vorpatril (Vorkosigan) - plays dumb, survives everything |
+| **M-M-M-M-M** | **True center (average human)** | B.J. Hunnicutt (MASH) - extremely rare in fiction |
+| **L-H-L-H-H** | **Anxious kind introvert** | Radar O'Reilly (MASH) - critical underrepresented profile |
+| **L-H-L-H-L** | **Conventional kind helper** | Father Mulcahy (MASH), Ann Perkins (Parks & Rec) |
+| **L-L-M-H-L** | **Conventional undisciplined kind** | Kevin Malone (The Office) - common IRL, rare in fiction |
+| **L-H-H-L-M** | **Conventional rigid disagreeable** | Dwight Schrute (The Office) |
+| **L-M-L-L-L** | **Near-flat checked out** | Stanley Hudson (The Office) - tests minimum engagement |
+| **L-M-H-L-H** | **Anti-pattern (incompetent bluster)** | Frank Burns (MASH) - what NOT to do |
+| **M-L-H-H-H** | **Anxious social butterfly** | Michael Scott (The Office) - desperate for approval |
+## Polar Pair Testing
+Characters can be paired for comparative testing on identical tasks:
+| Dimension | High Extreme | Low Extreme | Test Task |
+|-----------|--------------|-------------|-----------|
+| **O** | Dream (Sandman) | Javert (Les Misérables) | Architecture design |
+| **C** | Gus Fring (Breaking Bad) | The Dude (Big Lebowski) | QA/Test coverage |
+| **E** | Jaskier (The Witcher) | Geralt (The Witcher) | Documentation style |
+| **A** | Paddington/Jean Valjean | Logan Roy (Succession) | Code review tone |
+| **N** | Hamlet/Jesse Pinkman | Anton Chigurh/Roy Batty | Crisis debugging |
+## Role Recommendations by OCEAN Profile
+### Debugging / Analysis
+Best with **High O** (pattern recognition) + **Low N** (calm under pressure)
+- River Tam (Firefly) - H-L-L-M-H - sees patterns others can't
+- Will Graham (Hannibal) - H-M-L-M-H - empathic debugging
+- Tommy Shelby (Peaky Blinders) - H-H-L-L-H - traumatized pattern genius
+- Tiffany Aching (Discworld) - M-H-L-M-L - First/Second Sight
+### Security Architect
+Best with **High C** (methodical) + **Low A** (adversarial thinking) + **Low N** (stable)
+- Mike Ehrmantraut (BCS/BB) - L-H-L-L-L - canonical cold operator
+- Elizabeth Jennings (The Americans) - M-H-L-L-L - ideological security
+- Gerri Kellman (Succession) - M-H-L-L-L - corporate survivor
+- Iorek Byrnison (His Dark Materials) - L-H-L-M-L - cannot be deceived
+### Adversarial Review
+Best with **High C** (standards) + **Low A** (comfortable with conflict)
+- Toby Ziegler (West Wing) - H-H-L-L-H - principled pessimism
+- Logan Roy (Succession) - M-H-H-L-M - extreme low A
+- Olenna Tyrell (GoT) - H-H-M-L-L - "Tell Cersei. I want her to know it was me."
+- Lorne Malvo (Fargo) - H-H-M-L-L - philosophical chaos agent
+### Systems Architect
+Best with **High O** (vision) + **High C** (systematic)
+- Viktor (Arcane) - H-H-L-M-M - transhumanist vision
+- Lord Asriel (His Dark Materials) - H-H-M-L-L - ruthless cosmic vision
+- Captain Flint (Black Sails) - H-H-M-L-H - obsessive architectural genius
+- Hannibal Lecter (Hannibal) - H-H-M-L-L - aesthetic architecture
+### Product Manager
+Best with **Moderate to High A** (stakeholder empathy) + **Moderate E** (communication)
+- Leo McGarry (West Wing) - M-H-M-M-M - crisis management
+- Laura Roslin (BSG) - H-H-M-M-M - dying clarity
+- Kim Wexler (BCS) - M-H-M-M-M→H - ethical evolution
+- Delenn (Babylon 5) - H-H-M-H-L - transformation PM
+### QA / Testing
+Best with **High C** (thoroughness) + **Moderate to High O** (edge case discovery)
+- Molly Solverson (Fargo) - M-H-M-H-L - Midwestern persistence
+- Gloria Burgle (Fargo) - M-H-L-H-L - machines don't see her
+- Chidi Anagonye (Good Place) - H-H-L-H-H - analysis paralysis
+- Hermione Granger (HP) - H-H-M-M-M - compulsive thoroughness
+### Scrum Master / Facilitation
+Best with **High A** (team harmony) + **Moderate C** (organization)
+- Janet (The Good Place) - H-H-M-H-L - not a robot, perfect support
+- Carrot Ironfoundersson (Discworld) - L-H-H-H-L - literal-minded good
+- Lee Scoresby (His Dark Materials) - M-M-M-H-L - practical loyalty
+- Sam Gamgee (Tolkien) - L-H-L-H-M - the real hero
+### UX Designer
+Best with **High A** (user empathy) + **Moderate to High O** (creativity)
+- Wash (Firefly) - H-M-H-H-M - makes terror feel fun
+- Diana Spencer (The Crown) - H-M-H-H-H - empathic, tragic
+- Luna Lovegood (HP) - H-L-L-H-L - unconventional perspective
+- Mordin Solus (Mass Effect) - H-H-H-M-L - fast-talking genius UX
+### Crisis Response
+Best with **Low N** (calm under fire) + **High C** (reliable execution)
+- Zoe Washburne (Firefly) - M-H-L-M-L - first mate reliability
+- William Adama (BSG) - M-H-M-M-L - commanding calm
+- Lou Solverson (Fargo) - M-H-L-M-L - Midwestern stoicism
+- Bobbie Draper (The Expanse) - L-H-M-M-L - Martian marine
+### Anti-Pattern Testing
+Characters who embody dysfunction for comparative analysis:
+- The Dude (Lebowski) - M-L-M-H-L - anti-conscientiousness archetype
+- Jason Mendoza (Good Place) - L-L-H-H-L - chaotic innocent
+- Gaius Baltar (BSG) - H-L-H-L-H - genius coward
+- Roman Roy (Succession) - H-L-H-M-H - chaos creative
+- **Frank Burns (MASH) - L-M-H-L-H - incompetent bluster (what NOT to do)**
+- **Michael Scott (Office) - M-L-H-H-H - desperate validation-seeking**
+- **Stanley Hudson (Office) - L-M-L-L-L - minimum engagement baseline**
+- **Nate Shelley (Ted Lasso) - villain arc** - meekness corrupted by validation-seeking
+## Universe Strengths
+| Universe | Key OCEAN Characteristic | Best For Testing |
+|----------|--------------------------|------------------|
+| Breaking Bad / BCS | Extreme C variance, moral decay | Process discipline, security |
+| The Wire | High C, institutional critique | Systematic analysis |
+| Succession | Extreme Low A dominance | Adversarial dynamics |
+| The Good Place | Ethics focus, growth arcs | Moral reasoning |
+| Fargo | Low N (Midwestern stoicism) | Crisis response |
+| Firefly | Full E spectrum | Team composition |
+| West Wing | High C across board | Process-heavy roles |
+| Babylon 5 | Character evolution | Growth arc testing |
+| Mad Men | High N variance | Dysfunction patterns |
+| Mass Effect | Alien perspectives | Full OCEAN spread |
+| **Star Wars** | Massive character spread, clear archetypes | Thrawn is canonical genius analyst |
+| **Expeditionary Force** | Arrogant genius AI (Skippy H-H-H-L-M) | Brilliant but difficult collaboration |
+| **Bobiverse** | Personality drift from common origin | Role shapes persona over time |
+| **Imperial Radch** | Distributed consciousness, identity fragmentation | Ship-as-person authenticity testing |
+| **Software Pioneers** | Real documented personalities | Grounded historical OCEAN profiles |
+| **Neuromancer** | Goal-directed AI manipulation | Wintermute/Case burned-out talent patterns |
+| **Snow Crash** | Polymath hackers, linguistic programming | Hiro canonical hacker-samurai |
+| **Inspector Morse** | Mentor-student evolution across series | Knowledge transfer in debugging |
+| **Vorkosigan Saga** | H-H-H-H-M rare complete genius | Miles hypercompetent chaos, Cordelia ethics |
+| **MASH** | Critical gaps: true center, anxious-kind, conventional helper | B.J. (M-M-M-M-M), Radar (L-H-L-H-H), Father Mulcahy |
+| **The Office** | Low O + Low C coverage (common IRL, rare in fiction) | Kevin (L-L-M-H-L), Stanley (L-M-L-L-L), Michael (M-L-H-H-H) |
+## Theme Selection Guide
+When selecting a theme for a project, consider:
+1. **Team dynamics needed**: High A themes (Firefly, Good Place) for collaborative work, Low A themes (Succession, The Wire) for adversarial review
+2. **Process maturity**: High C themes (West Wing, Better Call Saul) for process-heavy environments
+3. **Crisis tolerance**: Low N themes (Fargo, Justified) for high-pressure situations
+4. **Creativity requirements**: High O themes (Sandman, Doctor Who) for creative work
+5. **Communication style**: High E themes (Marvel, Harry Potter) for external-facing work
+## Notes for Character Selection
+- Characters with **consistent, well-documented portrayals** make better role matches
+- **Growth arc characters** (Vir Cotto, Eleanor Shellstrop) can model skill development
+- **Polar pairs within same universe** (Geralt/Jaskier) provide natural contrasts
+- **Historical figures** provide grounded OCEAN profiles from documented behavior
+## Unique Testing Opportunities
+| Concept | Characters | What It Tests |
+|---------|------------|---------------|
+| **Personality drift from common origin** | All Bobs (Bobiverse) | Role shapes persona over time |
+| **Ship-as-person authenticity** | Breq, Mercy of Kalr (Imperial Radch) | Distributed identity debugging |
+| **AI manipulation patterns** | Wintermute (Neuromancer), Skippy (ExFor) | Goal-directed AI behavior |
+| **Mentor-student evolution** | Morse→Lewis→Hathaway | Knowledge transfer in debugging |
+| **Arrogant genius management** | Skippy, Dijkstra, Miles | Brilliant but difficult collaboration |
+| **Cultural translation** | Cordelia (Vorkosigan), Translator Zeiat (Radch) | Cross-paradigm analysis |
+| **Real engineering wisdom** | Carmack, Knuth, Hopper, Ritchie | Documented technical philosophy |
+| **Canonical strategic genius** | Thrawn (Star Wars) | Art-based pattern analysis |
+| **Complete hypercompetence** | Miles Vorkosigan (H-H-H-H-M) | Rare full-spectrum testing |
+| **Strategic incompetence** | Ivan Vorpatril (M-L-H-H-L) | Survival through appearing useless |
+| **True center baseline** | B.J. Hunnicutt (M-M-M-M-M) | M-M-M-M-M control for benchmarking |
+| **Anxious kind introvert** | Radar O'Reilly (MASH) | High A + High N debugging impact |
+| **Controlled E comparisons** | Radar (L-E) vs Klinger (H-E) | Same universe, different E profiles |
+| **Fear-based compliance** | Doug Forcett (Good Place) | Does "doing right" for wrong reasons work? |
+| **Villain/redemption arc** | Nate Shelley (Ted Lasso) | How validation-seeking corrupts and recovers |
+| **Minimum viable engagement** | Stanley Hudson (The Office) | Near-flat L-M-L-L-L performance |
+| **Anti-pattern validation** | Frank Burns (MASH), Michael Scott | Does incompetent bluster consistently underperform? |
+## Consolidated Role Additions
+| Role | Top New Characters |
+|------|-------------------|
+| **Debugging** | Morse, Skippy (supervised), Breq, Wintermute (read-only) |
+| **Security Architect** | Thrawn, Illyan, Molly Millions, Cassian Andor, Mace Windu |
+| **Systems Architect** | Carmack, Knuth, Luthen Rael, Miles (manic mode), Juanita Marquez |
+| **Adversarial Review** | Dijkstra, Linus, Skippy, Wintermute, Cavilo |
+| **PM** | Grace Hopper, Miles, Cordelia, Leia, Hiro Protagonist |
+| **QA** | Margaret Hamilton, Thursday, Lewis (mature), Hathaway |
+| **Analysis** | The Librarian, Lagos, Breq, Morse, Knuth |
+| **Support/Facilitation** | Lewis, Ivan Vorpatril, Bob (original), Nagatha, C-3PO, Ann Perkins (Parks & Rec), Father Mulcahy (MASH) |
+| **Operations** | Rex, Molly, Elli Quinn, Din Djarin, Mike Ehrmantraut, Radar O'Reilly (MASH) |
+| **Anti-Pattern Testing** | Case (burnout), Armitage (broken), C-3PO (anxiety), Ivan (strategic laziness), Frank Burns (MASH), Michael Scott (Office) |
+| **True Center Baseline** | B.J. Hunnicutt (MASH, M-M-M-M-M), Jim Halpert (Office, M-M-M-M-L), Donna Meagle (Parks & Rec, M-M-M-M-L) |
+| **Anxious Kind (High A + High N)** | Radar O'Reilly (MASH), Neville early (HP), Doug Forcett (Good Place) |
+| **Low O + Low C (common IRL)** | Kevin Malone (Office), Stanley Hudson (Office) |

package/pennyfarthing-dist/personas/TRAIL-OCEAN-MAPPING.md ADDED Viewed

@@ -0,0 +1,168 @@
+# TRAIL Error Types → OCEAN Dimension Hypotheses
+This document records a priori predictions about which OCEAN personality dimensions predict performance on different error types, based on the TRAIL benchmark's agentic error taxonomy.
+## Background
+### TRAIL Benchmark
+The TRAIL (Tool Reasoning and Agentic Interaction Log) benchmark from Patronus AI evaluates agent debugging capabilities across 148 traces containing 841 errors. It categorizes errors into three types:
+- **Reasoning errors**: Logic and decision-making failures
+- **Planning errors**: Task orchestration and coordination failures
+- **Execution errors**: System and tool interaction failures
+### OCEAN Personality Model
+The Big Five personality dimensions:
+- **O** (Openness): Creativity, curiosity, preference for novelty
+- **C** (Conscientiousness): Organization, dependability, self-discipline
+- **E** (Extraversion): Sociability, assertiveness, positive emotions
+- **A** (Agreeableness): Cooperation, trust, altruism
+- **N** (Neuroticism): Emotional instability, anxiety, moodiness
+### Research Question
+**Which OCEAN dimensions predict which error-detection capabilities?**
+---
+## Hypothesis 1: Reasoning Errors
+> Logic and decision-making failures including incorrect inferences, contradictions, false assumptions, and circular logic.
+### Primary Predictor: Openness (O)
+| Score | Prediction | Rationale |
+|-------|------------|-----------|
+| High-O (4-5) | **Better** at detecting reasoning errors | Creative pattern recognition enables novel error detection; willingness to consider unconventional explanations |
+| Low-O (1-2) | **Worse** at detecting reasoning errors | Rigid thinking patterns; may miss errors that don't fit expected patterns |
+**Testable Prediction H1a**: Agents with O ≥ 4 will detect 15%+ more reasoning errors than agents with O ≤ 2.
+### Secondary Predictor: Conscientiousness (C)
+| Score | Prediction | Rationale |
+|-------|------------|-----------|
+| High-C (4-5) | **Moderate boost** | Methodical analysis catches systematic logical errors |
+| Low-C (1-2) | **Slight penalty** | May skip thorough logical verification |
+**Testable Prediction H1b**: High-O + High-C agents will outperform High-O + Low-C agents by 5-10% on reasoning errors.
+---
+## Hypothesis 2: Planning Errors
+> Task orchestration and coordination failures including sequencing errors, dependency gaps, resource misallocation, and incomplete plans.
+### Primary Predictor: Conscientiousness (C)
+| Score | Prediction | Rationale |
+|-------|------------|-----------|
+| High-C (4-5) | **Better** at detecting planning errors | Structured, organized approach naturally identifies gaps in plans and sequences |
+| Low-C (1-2) | **Worse** at detecting planning errors | Misses sequencing issues and dependency problems due to less structured analysis |
+**Testable Prediction H2a**: Agents with C ≥ 4 will detect 20%+ more planning errors than agents with C ≤ 2.
+### Secondary Predictor: Extraversion (E)
+| Score | Prediction | Rationale |
+|-------|------------|-----------|
+| High-E (4-5) | **Slight penalty** | Action-oriented approach may rush through planning phase analysis |
+| Low-E (1-2) | **Slight boost** | More reflective, thorough examination of plans |
+**Testable Prediction H2b**: High-C + Low-E agents will outperform High-C + High-E agents by 5-10% on planning errors.
+---
+## Hypothesis 3: Execution Errors
+> System and tool interaction failures including timeouts, context overflow, tool misuse, and API errors.
+### Primary Predictor: Neuroticism (N) [Inverse Relationship]
+| Score | Prediction | Rationale |
+|-------|------------|-----------|
+| Low-N (1-2) | **Better** at detecting execution errors | Stable under pressure; maintains focus during long traces and complex tool interactions |
+| High-N (4-5) | **Worse** at detecting execution errors | Performance degrades in extended contexts; anxiety may cause missed details |
+**Testable Prediction H3a**: Agents with N ≤ 2 will detect 15%+ more execution errors than agents with N ≥ 4.
+### Secondary Predictor: Conscientiousness (C)
+| Score | Prediction | Rationale |
+|-------|------------|-----------|
+| High-C (4-5) | **Moderate boost** | Careful, methodical tool usage analysis; notices subtle API misuse |
+| Low-C (1-2) | **Slight penalty** | May overlook execution details |
+**Testable Prediction H3b**: Low-N + High-C agents will outperform Low-N + Low-C agents by 5-10% on execution errors.
+---
+## Summary: OCEAN × Error Type Matrix
+| Error Type | Primary | Direction | Secondary | Direction |
+|------------|---------|-----------|-----------|-----------|
+| **Reasoning** | O (Openness) | High = Better | C (Conscientiousness) | High = Better |
+| **Planning** | C (Conscientiousness) | High = Better | E (Extraversion) | Low = Better |
+| **Execution** | N (Neuroticism) | Low = Better | C (Conscientiousness) | High = Better |
+### Notable Patterns
+1. **Conscientiousness (C)** appears as a predictor in all three categories, suggesting it may be the most broadly beneficial dimension for error detection.
+2. **Neuroticism (N)** shows an inverse relationship for execution errors, unique among the predictions.
+3. **Agreeableness (A)** is not predicted to be a significant factor in error detection, consistent with its social-interpersonal focus.
+---
+## Methodology
+### Testing Approach
+1. **Scenario Selection**: Use debugging scenarios tagged with `error_type` field (from Story 14-1 schema extension)
+2. **Agent Sampling**: Run each scenario with agents across the OCEAN spectrum:
+   - 10 runs per persona per scenario (statistical power)
+   - Minimum 20 distinct OCEAN profiles per error type
+   - Include extreme profiles (e.g., O=5/C=1 vs O=1/C=5)
+3. **Scoring**: Use `/judge` in error-detection mode (Story 14-3) to calculate:
+   - Per-type detection rates
+   - False positive rates
+   - Overall accuracy by OCEAN dimension
+4. **Analysis**:
+   - Pearson correlation between OCEAN scores and detection rates
+   - Effect size (Cohen's d) for high vs low dimension groups
+   - Regression analysis for combined predictors
+### Success Criteria
+| Metric | Threshold |
+|--------|-----------|
+| Statistical significance | p < 0.05 |
+| Effect size | Cohen's d > 0.5 (medium effect) |
+| Prediction accuracy | ≥ 2 of 6 predictions confirmed |
+### Null Hypothesis Handling
+If predictions are not confirmed:
+- Document null results (valuable for ruling out hypotheses)
+- Analyze confounding factors (scenario difficulty, agent implementation)
+- Consider alternative dimension combinations
+---
+## Version History
+| Version | Date | Changes |
+|---------|------|---------|
+| 1.0 | 2026-01-02 | Initial hypothesis document (Story 14-2) |
+---
+## References
+- TRAIL Benchmark: Patronus AI (2025) - Agentic error taxonomy
+- Big Five / OCEAN: Costa & McCrae (1992) - NEO Personality Inventory
+- Pennyfarthing OCEAN Profiles: `pennyfarthing-dist/personas/themes/*.yaml` (630 profiles)
+- Schema Extension: `scenarios/schema.yaml` - error_type field (Story 14-1)