npm - ai-or-die - Versions diffs - 0.1.22 → 0.1.23 - Mend

ai-or-die 0.1.22 → 0.1.23

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/.github/workflows/ci.yml +29 -4
package/CLAUDE.md +14 -2
package/docs/adrs/0008-e2e-parallelization.md +65 -0
package/docs/agent-instructions/02-testing-and-validation.md +19 -7
package/docs/agent-instructions/03-tooling-and-pipelines.md +7 -8
package/docs/agent-instructions/04-handoff-protocol.md +63 -0
package/docs/agent-instructions/05-defensive-coding.md +170 -0
package/docs/agent-instructions/06-ci-first-testing.md +268 -0
package/docs/agent-instructions/07-docs-hygiene.md +124 -0
package/docs/agent-instructions/08-multi-agent-consultation.md +168 -0
package/e2e/playwright.config.js +7 -3
package/package.json +1 -1

package/.github/workflows/ci.yml CHANGED Viewed

@@ -52,7 +52,7 @@ jobs:
             playwright-report/
           retention-days: 14
-  test-browser-functional:
+  test-browser-functional-core:
     runs-on: ${{ matrix.os }}
     strategy:
       matrix:
@@ -65,13 +65,38 @@ jobs:
       - run: npm ci
       - name: Install Playwright browsers
         run: npx playwright install chromium --with-deps
-      - name: Run functional browser tests
-        run: npx playwright test --config e2e/playwright.config.js --project functional
+      - name: Run functional core tests
+        run: npx playwright test --config e2e/playwright.config.js --project functional-core
       - name: Upload Playwright report
         uses: actions/upload-artifact@v4
         if: ${{ !cancelled() }}
         with:
-          name: playwright-functional-${{ matrix.os }}
+          name: playwright-functional-core-${{ matrix.os }}
+          path: |
+            e2e/test-results/
+            playwright-report/
+          retention-days: 14
+  test-browser-functional-extended:
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        os: [ubuntu-latest, windows-latest]
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-node@v4
+        with:
+          node-version: '22'
+      - run: npm ci
+      - name: Install Playwright browsers
+        run: npx playwright install chromium --with-deps
+      - name: Run functional extended tests
+        run: npx playwright test --config e2e/playwright.config.js --project functional-extended
+      - name: Upload Playwright report
+        uses: actions/upload-artifact@v4
+        if: ${{ !cancelled() }}
+        with:
+          name: playwright-functional-extended-${{ matrix.os }}
           path: |
             e2e/test-results/
             playwright-report/

package/CLAUDE.md CHANGED Viewed

@@ -19,17 +19,29 @@ Available agents: **Architect**, **Engineer**, **QA Reviewer**, **Troubleshooter
 ### Documentation-Driven Workflow
 Before starting any task, consult the relevant documentation:
-- `docs/agent-instructions/` -- Philosophy, research guidelines, testing standards, tooling conventions
+- `docs/agent-instructions/` -- Agent workflow guides:
+  - `00-philosophy.md` -- Core principles
+  - `01-research-and-web.md` -- Research guidelines
+  - `02-testing-and-validation.md` -- Testing standards
+  - `03-tooling-and-pipelines.md` -- Tooling conventions
+  - `04-handoff-protocol.md` -- How to leave the repo clean for the next agent
+  - `05-defensive-coding.md` -- Error prevention, cross-platform traps
+  - `06-ci-first-testing.md` -- CI-only testing, E2E debugging, performance budget
+  - `07-docs-hygiene.md` -- Keeping documentation in sync
+  - `08-multi-agent-consultation.md` -- When and how to consult expert subagents
 - `docs/adrs/` -- Architecture Decision Records (check before proposing new patterns)
 - `docs/specs/` -- Component specifications (read before implementing, update after changing behavior)
 - `docs/architecture/` -- System diagrams and component overviews
-- `docs/history/` -- Incident post-mortems and debugging notes
+- `docs/history/` -- Solved problems and debugging notes (check before debugging any issue)
 ### Mandatory Rules
 1. **Spec updates with code changes**: When code behavior changes, the corresponding spec in `docs/specs/` must be updated in the same commit or PR.
 2. **ADR compliance**: Never contradict an accepted ADR. To change direction, write a new ADR that supersedes the old one.
 3. **Cross-platform support**: All code must work on both Windows and Linux. Use `path.join()` for file paths, provide `.sh` and `.ps1` script variants, and test on both platforms in CI.
 4. **Test coverage**: Every feature and bug fix requires tests. No exceptions.
+5. **CI-only testing**: All testing happens on GitHub Actions runners. Never test locally. E2E tests are the only true validation. Push → draft PR → CI → iterate.
+6. **Document what you solve**: Every solved problem goes in `docs/history/`. LLMs don't carry memories — written docs are the only institutional memory.
+7. **Consult before committing**: For significant decisions, spawn expert subagents (architect, principal engineer, lead QA, PM, designer, user researcher) in parallel. See `docs/agent-instructions/08-multi-agent-consultation.md`.
 ## Common Commands

package/docs/adrs/0008-e2e-parallelization.md ADDED Viewed

@@ -0,0 +1,65 @@
+# ADR-0008: E2E Test Parallelization Strategy
+## Status
+**Accepted**
+## Date
+2026-02-07
+## Context
+The E2E test suite has grown to 16 spec files across 6 Playwright projects. The `functional` project — containing tests 02-07, 09-image-paste, and 09-background-notifications — runs approximately 30 tests sequentially with `workers: 1`. On GitHub Actions runners, this takes 7-15 minutes per platform, exceeding the 7-minute performance budget for CI feedback loops.
+Fast CI feedback is critical because all testing happens exclusively on GitHub runners (no local testing). The push → CI → fix → push cycle must be fast enough that agents can iterate efficiently.
+## Decision
+Split the functional test group into two sub-groups and enable parallel workers in CI:
+### Test Split
+- **`functional-core`**: Tests `02-terminal-io`, `03-clipboard`, `04-context-menu`, `05-tab-switching` (core terminal interaction features)
+- **`functional-extended`**: Tests `06-large-paste`, `07-vim-and-session`, `09-image-paste`, `09-background-notifications` (extended features and cross-cutting concerns)
+### Parallel Workers
+- Set `workers: process.env.CI ? 2 : 1` in `e2e/playwright.config.js`
+- CI runs 2 Playwright workers per job for parallel test execution
+- Local development retains 1 worker for debugging simplicity (though local testing is not the primary workflow)
+### CI Pipeline Changes
+- Replace single `test-browser-functional` job with two: `test-browser-functional-core` and `test-browser-functional-extended`
+- Each runs independently and in parallel with all other browser test jobs
+- Each uploads artifacts with distinct names for failure diagnosis
+### Why this works
+- Each test already creates its own server instance via `createServer()` with an ephemeral port (port 0)
+- Sessions are per-server, eliminating cross-test state contamination
+- Playwright provides browser context isolation between parallel tests
+- No shared filesystem resources detected in the test suite
+## Consequences
+### Positive
+- No CI job exceeds 7 minutes — faster feedback for the push-fix-push workflow
+- More granular job names in CI (functional-core vs functional-extended) aid debugging — agents can immediately see which category of tests failed
+- Parallel workers within jobs further reduce wall-clock time
+- Sets a pattern for future test group splits as the suite grows
+### Negative
+- More CI jobs to monitor (6 browser test job types instead of 5, plus unit tests and build-binary)
+- Artifact names become longer and more numerous
+- If test isolation assumptions prove wrong, parallel execution could introduce flakiness (mitigated by the existing ephemeral-port pattern)
+### Neutral
+- Existing test files require no code changes — only configuration and CI workflow updates
+- The `workers: 2` setting is conservative and can be increased if runners have sufficient resources
+## Notes
+- When any job approaches 6 minutes consistently, split it further
+- When the test suite exceeds 80 tests, re-evaluate the overall split strategy
+- Monitor for flaky tests that may indicate parallel execution issues

package/docs/agent-instructions/02-testing-and-validation.md CHANGED Viewed

@@ -26,14 +26,26 @@ Write tests alongside implementation, not after. The workflow:
 - Use temp directories for file system tests (see `session-store.test.js` pattern)
 - Test cross-platform behavior: path construction, command resolution, shell detection
+## CI-Only Testing
+All testing happens on GitHub Actions runners. No local test runs. Ever.
+- Local environments are unreliable: missing native modules, stale state, platform differences
+- CI provides fresh, reproducible, cross-platform results every time
+- E2E tests are the only true validation — if they pass on CI, the feature works
+The workflow: write code → push to branch → open draft PR → CI runs → read results → fix → push again.
+See `docs/agent-instructions/06-ci-first-testing.md` for the complete CI workflow guide, job map, and debugging playbook.
 ## Self-Validation
 Before committing, every agent must:
-1. Run `npm test` — all tests pass
-2. Run `npm start` — server boots without errors
-3. Run `scripts/validate.sh` (Linux) or `scripts/validate.ps1` (Windows)
-4. Verify the change doesn't break existing functionality
+1. Push to branch and open a draft PR to trigger CI
+2. Verify all CI jobs pass on both ubuntu-latest and windows-latest
+3. Check `docs/history/` for known issues if any job fails
+4. Verify the change doesn't break existing functionality (CI confirms this)
 ## What to Test
@@ -50,9 +62,9 @@ Before committing, every agent must:
 - Auth middleware behavior
 ### For Client Changes
-- Manual browser testing (create session, select tool, verify output)
-- Check mobile responsiveness
-- Verify WebSocket reconnection
+- E2E tests via Playwright (verified on CI, never locally)
+- Mobile viewport tests via mobile-iphone and mobile-pixel Playwright projects
+- WebSocket reconnection covered by E2E functional tests
 ## When Tests Fail

package/docs/agent-instructions/03-tooling-and-pipelines.md CHANGED Viewed

@@ -14,14 +14,13 @@ If you perform a verification task twice, script it. All scripts live in the `sc
 ### GitHub Actions
-The CI pipeline (`.github/workflows/ci.yml`) runs on every push and PR:
-1. **Matrix**: Runs on both `ubuntu-latest` and `windows-latest`
-2. **Install**: `npm ci`
-3. **Lint**: ESLint check
-4. **Test**: `npm test` with coverage reporting
-5. **Audit**: `npm audit` for security vulnerabilities
-6. **Docs Check**: Verify docs/ structure exists
+The CI pipeline (`.github/workflows/ci.yml`) runs on every push and PR. It runs 8 job types in parallel across ubuntu-latest and windows-latest (16 total jobs):
+- **Unit tests**: `npm test` + `npm audit`
+- **Browser E2E tests**: 6 Playwright job types (golden-path, functional-core, functional-extended, mobile, visual-regression, new-features)
+- **Binary build**: SEA binary compilation + smoke tests
+See `06-ci-first-testing.md` for the full CI job map, artifact details, and debugging workflow. CI is the only authority on whether code works (see ADR-0008 for the parallelization strategy).
 ### Release Pipeline

package/docs/agent-instructions/04-handoff-protocol.md ADDED Viewed

@@ -0,0 +1,63 @@
+# Handoff Protocol
+## The Golden Rule
+Every session ends with a cleaner repo than it started. If you touched it, you documented it. If you broke it, you fixed it. If you couldn't finish, you left a trail.
+## Pre-Handoff Checklist
+Before ending any work session, verify:
+1. **All CI jobs pass.** Push to your branch and check GitHub Actions. Both `ubuntu-latest` and `windows-latest` must be green. Do not hand off a red build.
+2. **Documentation is updated.** Specs in `docs/specs/` match the current code. ADRs are written for any architectural decisions made during the session.
+3. **No orphaned work-in-progress.** No half-implemented features sitting uncommitted. Everything is either committed and pushed, or explicitly tracked in a GitHub issue.
+4. **Commit messages explain "why", not just "what".** A future agent reading the git log should understand the reasoning without opening the diff.
+5. **New patterns and conventions are documented.** If you introduced a new coding pattern, utility, or convention, write it down in the relevant spec or instruction doc.
+## Work-in-Progress Protocol
+When you cannot finish a task:
+- Create a GitHub issue with full context: what was attempted, where it stopped, what blockers exist, and what the next steps are.
+- Use `[WIP]` prefix in commit messages for incomplete work.
+- List which files are mid-change and what state they are in.
+- Reference relevant specs, ADRs, and CI run links.
+- Never leave broken tests on main. If your work breaks tests, either fix them or revert before ending.
+## Clean Commit Hygiene
+- Follow Conventional Commits: `feat:`, `fix:`, `docs:`, `test:`, `chore:`, `refactor:`.
+- One concern per commit. Do not mix a bug fix with a feature addition.
+- Reference GitHub issues in the message: `fix: resolve WebSocket race in image upload (#42)`.
+- Commit messages should be self-contained. Another agent reading the git log should understand what happened and why without reading the diff.
+## Session Context Dump
+What to leave behind for the next agent:
+- Updated specs in `docs/specs/` reflecting any behavior changes.
+- Research findings documented in the relevant ADR or spec.
+- Error patterns discovered during debugging added to `docs/history/`.
+- Decisions made and their rationale recorded in ADRs.
+- If you modified the CI pipeline, document what changed and why.
+## Log What You Solved
+When you encounter and solve a problem, document it in `docs/history/`. LLMs do not carry memories between sessions -- written docs are the only institutional memory. Every solved problem that is not documented is a problem that will be solved again.
+See `07-docs-hygiene.md` for the history entry format and full guidelines. Before debugging any issue, always check `docs/history/` first.
+## Anti-Patterns
+Do NOT do any of these:
+- Leave vague commit messages like "Made some changes" or "Updated stuff".
+- Push uncommitted or unstaged work.
+- Leave broken tests and move on.
+- Make architectural decisions without writing an ADR.
+- Solve a problem without documenting the solution.
+- Skip spec updates when behavior changes.
+- Assume the next agent will "figure it out".
+- Delete or disable tests to make CI pass.
+- Commit secrets, API keys, tokens, or `.env` files. Check `git diff --staged` for sensitive data before every commit.
+- Expand scope beyond what was asked. If you discover adjacent issues, file them as separate GitHub issues. Do not expand scope without explicit approval.

package/docs/agent-instructions/05-defensive-coding.md ADDED Viewed

@@ -0,0 +1,170 @@
+# Defensive Coding
+## Validate at Boundaries
+Trust nothing that crosses a system boundary. Every REST endpoint, WebSocket handler, and bridge method should validate its inputs before processing.
+Where boundaries exist in this codebase:
+- REST API handlers in `src/server.js` -- validate request params, body, headers
+- WebSocket message handlers -- validate `type` field, required fields per message type
+- Bridge methods (`startSession`, `sendInput`, `resize`) -- validate sessionId exists, dimensions are positive integers
+- Client-to-server messages -- validate session ownership, check session is active
+Pattern:
+```javascript
+// Bad
+handleMessage(wsId, message) {
+  const session = this.sessions.get(message.sessionId);
+  session.bridge.sendInput(message.data); // crashes if session doesn't exist
+}
+// Good
+handleMessage(wsId, message) {
+  if (!message.sessionId) {
+    return this.sendError(wsId, 'Missing sessionId');
+  }
+  const session = this.sessions.get(message.sessionId);
+  if (!session) {
+    return this.sendError(wsId, `Session '${message.sessionId}' not found`);
+  }
+  if (!session.active) {
+    return this.sendError(wsId, `Session '${message.sessionId}' is not active`);
+  }
+  session.bridge.sendInput(message.data);
+}
+```
+## Error Messages Are UI
+Error messages are read by other agents trying to debug. Make them actionable.
+Every error message should answer three questions:
+1. What went wrong?
+2. What was expected?
+3. What should be done about it?
+```javascript
+// Bad
+throw new Error('Invalid');
+throw new Error('Not found');
+throw new Error('Failed');
+// Good
+throw new Error(`Session '${sessionId}' not found. Available sessions: [${[...sessions.keys()].join(', ')}]`);
+throw new Error(`Bridge '${toolId}' is not available. Run 'which ${command}' to verify installation. Searched paths: ${searchPaths.join(', ')}`);
+throw new Error(`WebSocket message missing required field 'type'. Received: ${JSON.stringify(message)}`);
+```
+## Cross-Platform Landmines
+This codebase runs on both Windows and Linux. Every line of code that touches the filesystem, spawns a process, or handles paths must account for both.
+### Paths
+- ALWAYS use `path.join()`, never string concatenation with `/` or `\\`
+- Use `os.homedir()`, never `process.env.HOME` (undefined on Windows)
+- File paths are case-insensitive on Windows, case-sensitive on Linux
+- Use `path.resolve()` to normalize paths before comparison
+### Process Spawning
+- `where` on Windows, `which` on Linux -- check `process.platform`
+- Windows uses ConPTY, Linux uses standard PTY -- different buffering behavior
+- Executable extensions: `.exe`, `.cmd` on Windows, none on Linux
+- Shell: `cmd.exe` or `powershell.exe` on Windows, `bash` or `sh` on Linux
+### Line Endings
+- Never match output with exact strings -- use `.includes()` or `.trim()`
+- Windows may inject `\r\n` where Linux gives `\n`
+- PTY output may contain ANSI escape sequences -- strip them before comparing
+### The ConPTY Quirks
+- Writes larger than 4096 bytes can overflow the ConPTY buffer on Windows
+- Solution: chunked writes with delays (see `base-bridge.js` chunked write pattern)
+- ConPTY may echo input back -- don't assume output is only from the spawned process
+## Async Safety
+Node.js is async-first. Unhandled promise rejections crash the process.
+Rules:
+- Every `async` function must have try-catch at the top level
+- Every `.then()` chain must have a `.catch()`
+- Event handlers that call async code must wrap in try-catch
+- Use the spawn watchdog pattern from `base-bridge.js`: set a timer when spawning a process, kill it if no output arrives within 30 seconds
+```javascript
+// Bad -- unhandled rejection if startSession throws
+ws.on('message', (data) => {
+  const msg = JSON.parse(data);
+  this.startSession(msg.sessionId);
+});
+// Good
+ws.on('message', (data) => {
+  try {
+    const msg = JSON.parse(data);
+    this.startSession(msg.sessionId).catch(err => {
+      console.error(`Failed to start session ${msg.sessionId}:`, err);
+      this.sendError(wsId, err.message);
+    });
+  } catch (err) {
+    console.error('Failed to parse WebSocket message:', err);
+  }
+});
+```
+## Fail Fast, Fail Loud
+Silent failures are the worst kind. They create bugs that surface hours or sessions later, with no trail.
+- Assert preconditions at function entry -- don't wait until line 50 to discover the input was invalid
+- Log errors with full context before re-throwing: what function, what inputs, what state
+- Never `catch` and silently swallow: `catch (err) { /* ignore */ }` -- this is forbidden
+- If something "shouldn't happen," make it throw, not silently return null
+```javascript
+// Bad -- silent null propagation
+function getSession(id) {
+  return sessions.get(id); // returns undefined silently
+}
+// Good -- fail fast with context
+function getSession(id) {
+  const session = sessions.get(id);
+  if (!session) {
+    throw new Error(`getSession: no session with id '${id}'. Active sessions: ${sessions.size}`);
+  }
+  return session;
+}
+```
+## The "Fresh Machine" Test
+Before considering any code complete, ask yourself: "Would this work on a brand new GitHub Actions runner with nothing pre-installed except Node.js 22?"
+This means:
+- No reliance on globally installed tools (unless you check for them and give a clear error)
+- No hardcoded paths that only exist on your dev machine
+- No cached `node_modules` assumptions -- `npm ci` installs from scratch
+- No file system state left over from previous runs
+- No environment variables that aren't set in CI
+If the answer is "maybe," add a runtime check:
+```javascript
+const commandPath = await this.findCommandAsync();
+if (!commandPath) {
+  throw new Error(
+    `${this.toolName} CLI not found. Searched: ${this.searchPaths.join(', ')}. ` +
+    `Install ${this.toolName} or add it to PATH.`
+  );
+}
+```

package/docs/agent-instructions/06-ci-first-testing.md ADDED Viewed

@@ -0,0 +1,268 @@
+# CI-First Testing
+## E2E Tests Are the Source of Truth
+End-to-end tests are the only true way to validate that the system works. Unit tests verify isolated logic. E2E tests prove the whole system -- server, WebSocket, terminal, browser UI -- actually functions as a user would experience it.
+A feature is not done until its E2E tests pass on GitHub runners. If unit tests pass but E2E fails, the feature is broken. Period. No exceptions. No "it works on my machine." The GitHub runner is the only machine that matters.
+Every new feature must have E2E test coverage. Every bug fix must have a regression E2E test. The E2E suite is the contract that tells the next agent "this is what working looks like."
+### Long E2E waits indicate bugs
+If an E2E test requires long waits or generous timeouts to pass, that is a signal of a bug in the product code, not a test timing issue. No real user is going to wait 30 seconds for a terminal to respond or 10 seconds for a WebSocket to connect. If the test needs that much patience, the code is too slow and must be fixed. Tightening test timeouts is a legitimate way to catch performance regressions -- the test should reflect realistic user expectations, not compensate for sluggish code.
+## The Rule: CI Only
+CI is the only authority on whether code works. Never consider a feature done based on local results alone.
+Why:
+- Local environments accumulate stale state, cached modules, and leftover config
+- Native modules like `@lydell/node-pty` may not compile correctly locally
+- Playwright browsers may be outdated or misconfigured locally
+- Local testing only proves it works on one machine, one platform
+- CI runs on both ubuntu-latest AND windows-latest -- that is the real test
+- CI gives fresh, reproducible, cross-platform results every single time
+You may run quick local checks for rapid iteration (e.g., syntax checks, single-file linting), but a feature is not done until CI passes. The GitHub runner is the only environment whose results count.
+## The Workflow
+```
+Write code
+    |
+    v
+Push to branch
+    |
+    v
+Open draft PR (triggers CI automatically)
+    |
+    v
+Wait for CI results (~5-7 minutes)
+    |
+    v
+Read results: all green? --> Done
+    |
+    v (if red)
+Download failure artifacts
+    |
+    v
+Read traces, screenshots, terminal buffers
+    |
+    v
+Fix the issue
+    |
+    v
+Push again --> CI runs again --> repeat until green
+```
+Use `gh pr create --draft` to trigger CI without requesting review. Use `gh run watch` to monitor CI progress from the terminal.
+## CI Job Map
+The CI pipeline is defined in `.github/workflows/ci.yml`. It runs these jobs in parallel, each on both ubuntu-latest and windows-latest:
+| Job | What it tests | Playwright Project | Tests |
+|-----|--------------|-------------------|-------|
+| `test` | Unit tests (Mocha) | N/A | `test/*.test.js` |
+| `test-browser-golden` | Fresh user flow with real CLI | `golden-path` | `01-golden-path.spec.js` |
+| `test-browser-functional-core` | Core terminal features | `functional-core` | `02-terminal-io`, `03-clipboard`, `04-context-menu`, `05-tab-switching` |
+| `test-browser-functional-extended` | Extended features | `functional-extended` | `06-large-paste`, `07-vim-and-session`, `09-image-paste`, `09-background-notifications` |
+| `test-browser-mobile` | Mobile viewport behavior | `mobile-iphone`, `mobile-pixel` | `08-mobile-portrait.spec.js` |
+| `test-browser-visual` | Screenshot regression | `visual-regression` | `09-visual-regression.spec.js` |
+| `test-browser-new-features` | Latest features | `new-features` | `10-command-palette` through `14-nerd-font-rendering` |
+| `build-binary` | SEA binary build + smoke test | N/A | `scripts/smoke-test-binary.js` |
+Total: 16 parallel job executions (8 job types x 2 platforms). All must pass for a green CI.
+### Playwright Project Configuration
+The Playwright config at `e2e/playwright.config.js` defines how test files map to projects:
+- `golden-path` matches `01-golden-path.spec.js`
+- `functional-core` matches `/0[2-5]-.*\.spec\.js/`
+- `functional-extended` matches `/0[6-7]-.*\.spec\.js|09-image-paste\.spec\.js|09-background-.*\.spec\.js/`
+- `mobile-iphone` and `mobile-pixel` both match `08-mobile-portrait.spec.js` (with device-specific viewports)
+- `visual-regression` matches `09-visual-regression.spec.js`
+- `new-features` matches `/1[0-4]-.*\.spec\.js/`
+## Reading CI Failures
+When CI fails:
+1. **Go to the Actions tab** on the PR. Find the failed run.
+2. **Identify the failing job.** Note which platform (ubuntu vs windows).
+3. **Read the job log.** Expand the failed step, look for the error message.
+4. **Download artifacts.** Each browser test job uploads artifacts on failure:
+   - `playwright-{job}-{os}.zip` -- contains test results, screenshots, traces
+   - `screenshot-baselines-{os}` -- visual regression baselines (visual job only)
+   - `screenshot-diffs-{os}` -- visual diff images (visual job only, on failure)
+### What the artifacts contain
+- **Screenshots**: Captured on failure -- shows what the browser actually rendered
+- **Traces**: Playwright trace files -- DOM snapshots, network requests, console logs at each test step (captured on first retry via `trace: 'on-first-retry'`)
+- **Terminal buffer**: The xterm.js buffer content at failure time -- shows what the terminal displayed
+- **WebSocket logs**: Messages exchanged between client and server
+- **Console logs**: Browser console output captured by `setupPageCapture()`
+### Platform-specific failures
+- **Fails on Windows only**: Usually path handling (`\\` vs `/`), shell command differences (`where` vs `which`), ConPTY buffering, or line ending issues
+- **Fails on Linux only**: Usually permission issues, case-sensitive file names, or missing system dependencies
+- **Fails on both**: Real bug in application logic
+## Using Playwright Traces
+Download the trace from CI artifacts and view it:
+```bash
+# Download artifacts (use gh CLI)
+gh run download <run-id> -n playwright-functional-core-ubuntu-latest
+# View trace in browser
+npx playwright show-trace e2e/test-results/path-to-trace.zip
+```
+The trace viewer shows:
+- Step-by-step test execution with timestamps
+- DOM snapshot at each step (inspectable)
+- Network requests and responses
+- Console log entries
+- Screenshots before and after each action
+This is the most powerful debugging tool available. Use it.
+## Check History Before Debugging
+Before investigating any CI failure, check `docs/history/` for known issues and prior solutions. The problem may already be solved. If it's new, document the solution after fixing (see `07-docs-hygiene.md` for format).
+## E2E Tests as Debugging Tools
+E2E tests serve dual purpose: validation and documentation.
+### Understanding expected behavior
+Each spec file demonstrates how a feature should work. Before modifying a feature, read its test first -- it shows the intended behavior more precisely than any spec document.
+### When a test fails, consider both sides
+A failing test means something is wrong, but the bug could live in either place:
+- **Product code bug** -- The code doesn't work as intended. Fix the code, not the test (see ADR-0006).
+- **Test mistake** -- The test has incorrect assertions, wrong selectors, bad timing, or flawed assumptions about expected behavior.
+Always investigate both possibilities before committing a fix. Read the test carefully -- does it actually test the right thing? Then read the product code -- does it actually do what the spec says? Fixing the wrong side creates a false sense of security.
+### Reproducing bugs
+1. Find the closest existing test to the reported behavior
+2. Modify it (or add a new test case) to reproduce the issue
+3. Push to CI -- if the test fails, you have confirmed the bug
+4. Determine whether the bug is in the code or the test
+5. Fix the correct side
+6. Push again -- test should pass, confirming the fix
+### Adding regression tests
+Every bug fix must include an E2E test that would have caught the bug. This prevents regression and documents the fix for future agents.
+## Creating New E2E Tests
+### Naming Convention
+Tests are numbered by category:
+- `01-*` -- Golden path (fresh user flow)
+- `02-05` -- Core functional features (functional-core project)
+- `06-07` -- Extended functional features (functional-extended project)
+- `08-*` -- Mobile-specific
+- `09-*` -- Cross-cutting: `09-image-paste` and `09-background-notifications` in functional-extended, `09-visual-regression` in visual-regression project
+- `10-14` -- New features
+Add new tests with the next available number in the appropriate range. Currently the highest number is `14-nerd-font-rendering.spec.js`.
+### Test Structure
+```javascript
+const { test, expect } = require('@playwright/test');
+const { createServer, createSessionViaApi } = require('../helpers/server-factory');
+const {
+  waitForAppReady,
+  waitForTerminalCanvas,
+  typeInTerminal,
+  waitForTerminalText,
+  setupPageCapture,
+  attachFailureArtifacts,
+  joinSessionAndStartTerminal,
+} = require('../helpers/terminal-helpers');
+test.describe('Feature Name', () => {
+  let server, port, url;
+  test.beforeAll(async () => {
+    ({ server, port, url } = await createServer());
+  });
+  test.afterAll(async () => {
+    if (server) server.close();
+  });
+  test.afterEach(async ({ page }, testInfo) => {
+    await attachFailureArtifacts(page, testInfo);
+  });
+  test('should do the expected thing', async ({ page }) => {
+    setupPageCapture(page);
+    const sessionId = await createSessionViaApi(port, `Test_${Date.now()}`);
+    await page.goto(url);
+    await waitForAppReady(page);
+    await waitForTerminalCanvas(page);
+    await joinSessionAndStartTerminal(page, sessionId);
+    // ... test logic using terminal helpers
+  });
+});
+```
+### Available Helpers
+From `e2e/helpers/terminal-helpers.js`:
+- `waitForAppReady(page)` -- Wait for app to fully initialize
+- `waitForTerminalCanvas(page)` -- Wait for xterm.js container to render
+- `focusTerminal(page)` -- Focus the terminal textarea for keyboard input
+- `typeInTerminal(page, text)` -- Type text into the terminal with per-character delay
+- `pressKey(page, key)` -- Press a key or key combination (e.g. `'Enter'`, `'Control+c'`)
+- `readTerminalContent(page)` -- Read current terminal buffer via xterm.js API
+- `waitForTerminalText(page, text, timeout)` -- Wait for specific text to appear in terminal
+- `getTerminalDimensions(page)` -- Get terminal cols and rows
+- `setupPageCapture(page)` -- Capture WebSocket messages and console logs (call before `page.goto()`)
+- `attachFailureArtifacts(page, testInfo)` -- Attach debug artifacts on test failure (call in `afterEach`)
+- `waitForWebSocket(page)` -- Wait for WebSocket connection to be open
+- `joinSessionAndStartTerminal(page, sessionId)` -- Full session setup: join session and start terminal tool
+From `e2e/helpers/server-factory.js`:
+- `createServer()` -- Start a test server instance, returns `{ server, port, url }`
+- `createSessionViaApi(port, name)` -- Create a session via REST API, returns sessionId
+### Registering in Playwright Config
+Add new tests to the appropriate project in `e2e/playwright.config.js` by updating the `testMatch` pattern. Then update the corresponding CI job in `.github/workflows/ci.yml` if the new test does not already match an existing project regex.
+For new feature tests numbered 10-14, they automatically match the `new-features` project regex `/1[0-4]-.*\.spec\.js/`. If you need number 15+, update the regex.
+## Performance Budget
+No single CI job should take more than 7 minutes. This is a hard limit.
+Fast CI feedback is critical for the push-fix-push workflow. If a job exceeds 7 minutes:
+1. Check if the job has too many tests -- split into sub-groups
+2. Check for tests with excessive waits or timeouts that could be tightened
+3. Consider splitting the job into multiple CI matrix entries
+4. Open an issue to track and fix the performance regression
+Monitor job times after adding new E2E tests. Growth is expected, but the 7-minute budget must hold.

package/docs/agent-instructions/07-docs-hygiene.md ADDED Viewed

@@ -0,0 +1,124 @@
+# Documentation Hygiene
+## The Spec-Code Contract
+Every component in this codebase has a specification in `docs/specs/`. This is a binding contract:
+- If behavior changes, the spec MUST be updated in the same commit. Not the next commit. Not the next PR. The same commit.
+- If the spec says X and the code says Y, the code is wrong -- until the spec is deliberately updated.
+- Pull requests that change behavior without updating specs are incomplete and should not be merged.
+This is not bureaucracy. This is how agents that don't share memory stay in sync. The spec is the source of truth that persists across sessions.
+## When to Update What
+| You did this | Update this |
+|---|---|
+| Added a new feature | Write or update spec in `docs/specs/` + write ADR if architectural decision was made |
+| Fixed a bug | Update spec if behavior changed + add entry to `docs/history/` with root cause and fix |
+| Refactored code | Write ADR if pattern changed + update spec if API surface changed |
+| Added a dependency | Write ADR with research findings (version, license, CVE check, alternatives considered) |
+| Changed the CI pipeline | Update `docs/agent-instructions/03-tooling-and-pipelines.md` and `06-ci-first-testing.md` |
+| Changed WebSocket protocol | Update `docs/architecture/websocket-protocol.md` + update server spec |
+| Added a new bridge | Update `docs/specs/bridges.md` + update `docs/architecture/bridge-pattern.md` |
+| Changed E2E test structure | Update `docs/specs/e2e-testing.md` + update `06-ci-first-testing.md` CI job map |
+When in doubt: update the docs. Over-documentation is always better than under-documentation in an AI-agent-driven codebase.
+## ADR Lifecycle
+Architecture Decision Records are permanent artifacts. They capture the context, reasoning, and trade-offs of a decision at the time it was made.
+### Creating a new ADR
+- Use the template at `docs/adrs/0000-template.md`
+- Number sequentially: find the highest existing number and increment
+- Status: "Accepted" with today's date
+- Include: Context (why this decision was needed), Decision (what was chosen), Consequences (positive and negative)
+### Changing a decision
+Never edit an accepted ADR. The original context and reasoning are historically valuable.
+Instead:
+1. Create a new ADR that supersedes the old one
+2. In the new ADR, reference the old one: "Supersedes ADR-XXXX"
+3. In the old ADR, add a note: "Superseded by ADR-YYYY" with the date
+4. Keep the old ADR's original content intact
+### When an ADR is required
+- Choosing between architectural approaches (e.g., ADR-0001: bridge base class)
+- Adding or removing dependencies (e.g., ADR-0002: devtunnels over ngrok)
+- Changing system topology (e.g., ADR-0003: multi-tool architecture)
+- Platform-specific decisions (e.g., ADR-0004: cross-platform support)
+- Distribution changes (e.g., ADR-0005: single binary distribution)
+- Process decisions (e.g., ADR-0006: test-driven bug fixes)
+## History as Institutional Memory
+`docs/history/` is the most important directory for autonomous AI agents. It's where lessons live.
+LLMs don't carry memories between sessions. Every new session starts from zero context. The ONLY way to learn from past mistakes, debugging sessions, and hard-won insights is to write them down in `docs/history/`.
+### What to document
+- Non-trivial bug fixes (especially platform-specific ones)
+- CI failure patterns and their solutions
+- Cross-platform gotchas discovered during development
+- Debugging sessions that took significant effort
+- Performance issues and how they were resolved
+- Dependency conflicts and their resolutions
+### Format
+File name: `YYYY-MM-DD-short-description.md`
+Content structure:
+```markdown
+# Short Description
+## What Happened
+[The symptom or error observed. Include error messages, CI job names, platforms affected.]
+## Root Cause
+[What actually caused the issue. Be specific -- which file, which line, which assumption was wrong.]
+## Fix
+[What was changed and why. Reference commit hashes or PR numbers.]
+## Watch For
+[Conditions that might trigger the same issue again. What future agents should be careful about.]
+```
+### The rule
+Before debugging any failure, check `docs/history/` first. If the problem has been solved before, the answer is already there. If it hasn't, document your solution after fixing it.
+A solved problem that isn't documented is a problem that will be solved again.
+## Stale Docs Are Bugs
+Outdated documentation is not a low-priority cleanup task. It's a bug. It actively misleads the next agent, causing incorrect implementations, wasted CI cycles, and rework.
+Treat stale docs with the same urgency as a failing test:
+- If you notice a spec that doesn't match current behavior, update it immediately
+- If you find an ADR that references deleted code, note it
+- If a history entry has incorrect information, correct it
+- If agent instructions reference outdated patterns, fix them
+## Pre-Commit Documentation Checklist
+Before every commit, ask yourself these 6 questions:
+1. **Did I change behavior?** -- Update the relevant spec in `docs/specs/`
+2. **Did I make an architectural decision?** -- Write an ADR in `docs/adrs/`
+3. **Did I fix a bug?** -- Add a history entry in `docs/history/`
+4. **Did I solve a non-obvious problem?** -- Add a history entry in `docs/history/`
+5. **Did I change an API surface?** -- Update method signatures in the spec
+6. **Did I introduce a new pattern?** -- Document it in `docs/architecture/`
+If the answer to any of these is yes and you haven't updated docs, your commit is incomplete.

package/docs/agent-instructions/08-multi-agent-consultation.md ADDED Viewed

@@ -0,0 +1,168 @@
+# Multi-Agent Consultation
+## When In Doubt, Consult
+Don't guess at architecture. Don't guess at testing strategy. Don't guess at UX. Don't guess at requirements.
+When facing a decision that could go multiple ways, spawn specialized subagents to get expert perspectives before committing to an approach. The cost of consulting is a few minutes. The cost of guessing wrong is hours of rework, broken CI, and confused future agents.
+This is not optional for significant decisions. This is how a well-run engineering org operates -- you get input from experts before making calls that affect the whole system.
+If your runtime does not support spawning subagents, adopt the expert role yourself: explicitly state "Thinking as a Principal Engineer..." and reason from that perspective before proceeding. The goal is the expert thinking, not the subagent mechanism.
+## Available Expert Roles
+Beyond the 5 team agents (Architect, Engineer, QA Reviewer, Troubleshooter, Researcher), consult these senior perspectives when the situation calls for it:
+### Architect
+**When to consult:** System design, component boundaries, data flow, protocol changes.
+**Ask for:** Design review, alternative approaches, risk assessment.
+### Principal Engineer
+**When to consult:** Deep technical decisions, performance trade-offs, system reliability, concurrency issues, platform-specific behavior.
+**Ask for:** Technical feasibility assessment, performance implications, edge case analysis.
+### Lead QA
+**When to consult:** Test strategy, coverage gaps, regression risk, E2E test design, CI pipeline changes.
+**Ask for:** Test plan review, risk assessment, coverage recommendations.
+### Principal Program Manager
+**When to consult:** Requirements clarity, scope decisions, feature prioritization, user-facing changes, backwards compatibility.
+**Ask for:** Requirements validation, scope check, impact analysis.
+### Designer
+**When to consult:** UI/UX decisions, interaction patterns, accessibility, visual consistency, mobile behavior.
+**Ask for:** Interaction review, accessibility audit, visual consistency check.
+### Lead User Researcher
+**When to consult:** User impact assessment, usability concerns, workflow analysis, onboarding experience.
+**Ask for:** User impact assessment, usability review, workflow validation.
+## Parallel Consultation
+When facing a complex decision, spawn multiple expert subagents in parallel. Don't consult one at a time -- that wastes time.
+### Example: Changing the WebSocket protocol
+This affects architecture, implementation, testing, and user experience. Consult simultaneously:
+- **Architect** -- Is the protocol change consistent with existing patterns? What are the migration concerns?
+- **Principal Engineer** -- What are the performance implications? Are there concurrency edge cases?
+- **Lead QA** -- What tests need to change? What regression risks exist?
+All three can run in parallel and return independent assessments.
+### Example: Adding a new UI component
+- **Designer** -- Does it fit the existing design language? Is it accessible?
+- **Lead User Researcher** -- Will users understand it? Does it fit the workflow?
+- **Engineer** -- What's the implementation approach? What existing patterns apply?
+### Example: Debugging a platform-specific CI failure
+- **Troubleshooter** -- What's the root cause? What's the minimal fix?
+- **Principal Engineer** -- Is there a deeper architectural issue? Will this recur?
+- **Lead QA** -- What test coverage is missing? How do we prevent regression?
+## How to Frame a Consultation
+Give each subagent full context. A vague question gets a vague answer.
+### What to include in every consultation request
+1. **What you're trying to do** -- The goal, not just the task
+2. **What you've considered** -- Options you've thought about and why you're unsure
+3. **What constraints exist** -- Cross-platform requirements, performance budgets, backwards compatibility needs
+4. **What you need back** -- A specific deliverable: recommendation, risk assessment, alternative approaches, code review
+### Good consultation prompt
+```
+I need to add chunked file upload support to the WebSocket protocol.
+Context: Currently image uploads send the entire base64 payload in one message.
+For files over 1MB this causes WebSocket frame size issues on some browsers.
+Options I'm considering:
+1. Split into multiple WebSocket messages with sequence numbers
+2. Use a separate HTTP upload endpoint
+3. Use WebSocket binary frames with streaming
+Constraints:
+- Must work on both desktop and mobile browsers
+- Must not break existing image paste flow
+- Server must handle concurrent uploads from multiple sessions
+Please assess each option for: implementation complexity, reliability,
+cross-browser compatibility, and impact on existing code.
+```
+### Bad consultation prompt
+```
+How should I handle file uploads?
+```
+## When to Consult
+Always consult for:
+- **Architectural changes** -- New patterns, component restructuring, protocol changes
+- **Breaking API changes** -- WebSocket message format, REST endpoint changes
+- **New dependencies** -- Any npm package addition (Researcher for vetting, Architect for fit)
+- **UX-visible changes** -- Anything a user would notice (Designer + User Researcher)
+- **Test strategy changes** -- New testing patterns, CI pipeline changes (Lead QA)
+- **Performance-critical code** -- Anything in the hot path (Principal Engineer)
+- **Security-sensitive code** -- Auth, input validation, path traversal (Principal Engineer + QA)
+Skip consultation for:
+- Typo fixes
+- Comment updates
+- Straightforward bug fixes where the root cause is clear
+- Documentation-only changes
+## Synthesizing Advice
+When experts disagree (and they will), handle it systematically:
+1. **Document the disagreement** -- What does each expert recommend and why?
+2. **Identify the core tension** -- Is it between performance and simplicity? Between speed and correctness?
+3. **Make a decision** -- You can't wait for consensus. Weigh the arguments and choose.
+4. **Record it in an ADR** -- Document what was decided, what alternatives were considered, and why you chose this path.
+5. **Move forward** -- Don't second-guess. If the decision proves wrong later, a future agent can write a new ADR that supersedes.
+Disagreement between experts is a signal that the decision is important, not that it's impossible.
+## Post-Completion Review Is Mandatory
+After completing any non-trivial work, spawn a reviewer subagent to review what you did before considering the task done. This is not optional.
+Self-review is unreliable -- the same blind spots that led to mistakes in implementation will exist during self-review. An independent reviewer subagent operates with fresh context and catches issues you missed.
+### What the reviewer should check
+- Code correctness and edge cases
+- Cross-platform compatibility (Windows + Linux)
+- Test coverage completeness
+- Documentation updates (specs, ADRs, history)
+- Adherence to coding conventions
+- Security concerns (input validation, path traversal, injection)
+- Performance implications
+### How to run the review
+Spawn a QA Reviewer or Lead QA subagent with:
+1. A summary of what was changed and why
+2. The list of files modified
+3. The relevant spec and ADR references
+4. A request to verify: correctness, test coverage, doc completeness, cross-platform safety
+The reviewer's findings should be addressed before marking the task as done. If the reviewer identifies issues, fix them and re-review. No work is complete until it has been independently reviewed.

package/e2e/playwright.config.js CHANGED Viewed

@@ -6,7 +6,7 @@ module.exports = defineConfig({
   fullyParallel: false,
   forbidOnly: !!process.env.CI,
   retries: process.env.CI ? 1 : 0,
-  workers: 1,
+  workers: process.env.CI ? 2 : 1,
   timeout: 60000,
   expect: {
     timeout: 15000,
@@ -32,8 +32,12 @@ module.exports = defineConfig({
       testMatch: '01-golden-path.spec.js',
     },
     {
-      name: 'functional',
-      testMatch: /0[2-7]-.*\.spec\.js|09-image-paste\.spec\.js|09-background-.*\.spec\.js/,
+      name: 'functional-core',
+      testMatch: /0[2-5]-.*\.spec\.js/,
+    },
+    {
+      name: 'functional-extended',
+      testMatch: /0[6-7]-.*\.spec\.js|09-image-paste\.spec\.js|09-background-.*\.spec\.js/,
     },
     {
       name: 'mobile-iphone',

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "ai-or-die",
-  "version": "0.1.22",
+  "version": "0.1.23",
   "description": "Universal AI coding terminal — Claude, Copilot, Gemini & more in your browser",
   "main": "src/server.js",
   "bin": {