npm - @callumvass/forgeflow-dev - Versions diffs - 0.1.0 - Mend

@callumvass/forgeflow-dev 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

package/agents/architecture-reviewer.md +67 -0
package/agents/code-reviewer.md +44 -0
package/agents/implementor.md +98 -0
package/agents/planner.md +88 -0
package/agents/refactorer.md +39 -0
package/agents/review-judge.md +44 -0
package/agents/skill-discoverer.md +110 -0
package/extensions/index.js +1279 -0
package/package.json +42 -0
package/skills/code-review/SKILL.md +119 -0
package/skills/plugins/SKILL.md +58 -0
package/skills/stitch/SKILL.md +46 -0
package/skills/tdd/SKILL.md +115 -0
package/skills/tdd/deep-modules.md +33 -0
package/skills/tdd/interface-design.md +31 -0
package/skills/tdd/mocking.md +86 -0
package/skills/tdd/refactoring.md +10 -0
package/skills/tdd/tests.md +98 -0
package/src/index.ts +380 -0
package/src/pipelines/architecture.ts +67 -0
package/src/pipelines/discover-skills.ts +33 -0
package/src/pipelines/implement-all.ts +181 -0
package/src/pipelines/implement.ts +305 -0
package/src/pipelines/review.ts +183 -0
package/src/resolve.ts +6 -0
package/src/utils/exec.ts +13 -0
package/src/utils/git.ts +132 -0
package/src/utils/ui.ts +29 -0
package/tsconfig.json +12 -0
package/tsconfig.tsbuildinfo +1 -0
package/tsup.config.ts +15 -0

package/package.json ADDED Viewed

@@ -0,0 +1,42 @@
+{
+  "name": "@callumvass/forgeflow-dev",
+  "version": "0.1.0",
+  "type": "module",
+  "description": "Dev pipeline for Pi — TDD implementation, code review, architecture, and skill discovery.",
+  "keywords": [
+    "pi-package"
+  ],
+  "license": "MIT",
+  "repository": {
+    "type": "git",
+    "url": "git+https://github.com/callumvass/forgeflow.git",
+    "directory": "packages/dev"
+  },
+  "publishConfig": {
+    "provenance": true
+  },
+  "pi": {
+    "extensions": [
+      "./extensions"
+    ],
+    "skills": [
+      "./skills"
+    ],
+    "agents": [
+      "./agents"
+    ]
+  },
+  "scripts": {
+    "build": "tsup"
+  },
+  "dependencies": {
+    "@callumvass/forgeflow-shared": "*"
+  },
+  "peerDependencies": {
+    "@mariozechner/pi-ai": "*",
+    "@mariozechner/pi-agent-core": "*",
+    "@mariozechner/pi-coding-agent": "*",
+    "@mariozechner/pi-tui": "*",
+    "@sinclair/typebox": "*"
+  }
+}

package/skills/code-review/SKILL.md ADDED Viewed

@@ -0,0 +1,119 @@
+---
+name: code-review
+description: Structured, checklist-driven code review with confidence scoring and evidence requirements. Precision over recall.
+---
+# Code Review Skill
+## Review Order
+Always review in this order. Check each category completely before moving to the next.
+### 1. Logic & Correctness
+- Business logic errors (wrong conditions, missing branches)
+- Control flow bugs (off-by-one, infinite loops, unreachable code)
+- Wrong return values or incorrect transformations
+- State management bugs (stale state, missing updates, race conditions)
+- Dead wiring: new modules/classes only imported in test files, never called from production code
+### 2. Security
+- Injection vulnerabilities (SQL, XSS, command injection)
+- Auth/authz bypass (missing checks, privilege escalation)
+- Secrets exposure (hardcoded keys, tokens in logs)
+- Unsafe deserialization or eval usage
+### 3. Error Handling
+- Unhandled null/undefined that will crash at runtime
+- Missing error paths that silently fail
+- Swallowed exceptions hiding real failures
+- Error messages leaking internal details
+### 4. Performance
+- N+1 queries or unbounded loops over data
+- Missing pagination on unbounded result sets
+- Memory leaks (event listeners, subscriptions not cleaned up)
+- Unnecessary re-renders or recomputation in hot paths
+### 5. Test Quality
+- TDD compliance: tests verify behavior through public interfaces, not implementation
+- Boundary coverage: mocked boundaries have corresponding integration/contract tests
+- Mock fidelity: mocks encode correct assumptions about external systems
+- Missing tests for new behavior paths
+## Evidence Requirements
+Every finding MUST include:
+- **File path and line number(s)** — exact location
+- **Code snippet** — the problematic code, quoted verbatim
+- **Explanation** — why this is wrong (not "could be better", but "this WILL cause X")
+- **Suggested fix** — concrete, not vague
+Findings without evidence are invalid and will be rejected by the review judge.
+## Confidence Scoring
+Rate each finding 0-100:
+- **< 50**: Do not report.
+- **50-84**: Do not report. Below threshold.
+- **85-94**: Report. High confidence this is a real issue.
+- **95-100**: Report. Certain. Evidence directly confirms.
+**Threshold: only report findings with confidence >= 85.**
+## Severity Levels
+- **critical**: Will cause a bug, security vulnerability, data loss, or crash in production. Must fix before merge.
+- **major**: Significant logic error, missing error handling that will affect users. Must fix.
+- **minor**: Code quality issue, edge case gap, suboptimal pattern. Fix and merge.
+- **nit**: Style preference, naming suggestion, trivial improvement. Author's discretion.
+Guidelines:
+- Security findings are always `critical`.
+- Test Quality findings are `minor` unless they mask a real bug.
+## FINDINGS Output Format
+```markdown
+## Review: [scope description]
+### Finding 1
+- **Confidence**: [85-100]
+- **Severity**: [critical | major | minor | nit]
+- **Category**: [Logic | Security | Error Handling | Performance | Test Quality]
+- **File**: path/to/file.ts:42
+- **Code**: `the problematic code`
+- **Issue**: [clear explanation of what's wrong and what will happen]
+- **Fix**: [concrete suggestion]
+### Finding 2
+...
+## Summary
+- Total findings: N
+- Categories: [breakdown]
+- Overall assessment: [one sentence]
+```
+If no findings meet the confidence threshold:
+```markdown
+## Review: [scope description]
+No issues found above confidence threshold (85).
+## Summary
+- Reviewed: [what was checked]
+- Overall assessment: Code meets standards.
+```
+## Anti-Patterns — Do NOT Flag
+- **Naming/formatting** — linters handle this
+- **Style preferences** — subjective choices
+- **Theoretical edge cases** — "what if X is null?" when X is guaranteed non-null
+- **Architectural suggestions** — out of scope for PR review
+- **"You could also..." suggestions** — if it's not broken, don't suggest alternatives
+- **Over-engineering suggestions** — "add error handling for..." when the error can't happen
+- **Pre-existing issues** — problems that existed before this PR
+- **Missing features** — unless it's in the acceptance criteria
+- **Documentation gaps** — unless the code is genuinely incomprehensible

package/skills/plugins/SKILL.md ADDED Viewed

@@ -0,0 +1,58 @@
+---
+name: plugins
+description: Domain-specific plugin router. Scans project plugins, matches triggers against codebase/diff, returns which plugins to load for a given pipeline stage.
+---
+# Plugins
+Domain-specific knowledge that is progressively loaded based on the project's codebase and the current pipeline stage. Plugins live in the project repository, not in forgeflow.
+## Plugin Location
+Plugins are installed per-project at:
+```
+<repo-root>/.forgeflow/plugins/<name>/PLUGIN.md
+```
+Use `/discover-skills` to find and install plugins for your project's tech stack.
+## Plugin Structure
+```
+.forgeflow/
+  plugins/
+    <name>/
+      PLUGIN.md              # Triggers + stage-specific guidance (always read when matched)
+      references/            # Deep context (read lazily when needed)
+        *.md
+```
+Each `PLUGIN.md` has YAML frontmatter with trigger conditions and stage applicability:
+```yaml
+---
+name: Human-readable name
+description: One-line description
+triggers:
+  files: ["*.tsx", "*.jsx"]    # Glob patterns for project files
+  content: ["useQuery", "cn("] # Literal strings to search for in codebase/diff
+stages: [plan, implement, review, refactor, architecture]  # which pipeline stages use this plugin
+---
+```
+## How to Match Plugins
+Given a diff or codebase, scan each plugin at `<cwd>/.forgeflow/plugins/*/PLUGIN.md`:
+1. **files**: At least one file (changed or in codebase) matches any of the plugin's file glob patterns.
+2. **content**: At least one of the plugin's content strings appears in the diff or codebase.
+3. **stages**: The current pipeline stage is listed in the plugin's `stages` array.
+All three conditions must be true for a plugin to match.
+## Progressive Disclosure Layers
+1. **Trigger scan** — read only frontmatter, decide which plugins match. Cheap.
+2. **Plugin body** — read the matched `PLUGIN.md` body for stage-specific guidance. Medium cost.
+3. **Plugin references** — read files from `references/` only when deeper context is needed. Expensive, on-demand only.

package/skills/stitch/SKILL.md ADDED Viewed

@@ -0,0 +1,46 @@
+---
+name: stitch
+description: Stitch UI design reference. DESIGN.md is the styling authority, Stitch screens are the layout authority. Use when implementing UI with design system integration.
+---
+# Stitch — UI Design Reference
+Stitch is the source of truth for UI design. It provides two things:
+1. **DESIGN.md** — a design system document in the project root. Contains colors, typography, components, spacing, do's/don'ts. This is your styling bible.
+2. **Screens** — HTML mockups stored in a Stitch project. Each screen is a pixel-perfect reference for a specific page or component.
+## How Stitch Connects to the Project
+- **DESIGN.md** lives in the project root. If it doesn't exist, Stitch is not in use — skip all Stitch workflows.
+- **Stitch project ID** is specified in the PRD or issue (not in DESIGN.md).
+## Workflow: Implementing UI
+### 1. Read the Design System
+Read `DESIGN.md` first. Every visual decision (colors, fonts, spacing, elevation, component patterns) must follow its rules.
+### 2. Get Screen References
+If the issue body contains embedded screen HTML, use that as your layout reference. If not, and the project has Stitch MCP tools available, fetch screens using the project ID.
+### 3. Implement to Match
+For each screen relevant to your work:
+1. Use the HTML as your **exact visual target**.
+2. Implement the component to match the HTML structure, styling, and layout.
+3. Adapt to the project's framework (React, Vue, etc.) but the visual output must match.
+### 4. Generate Missing Screens
+When implementing a UI component that has **no matching screen**: describe the component in the issue or consult DESIGN.md for patterns. If Stitch MCP tools are available, generate a screen reference.
+## Rules
+- **DESIGN.md is the styling authority.** All visual decisions come from DESIGN.md.
+- **Screen HTML is the layout authority.** The visual output must match exactly.
+- **Copy Stitch classes verbatim.** Stitch HTML uses Tailwind classes. Use the exact same classes — do NOT translate to inline styles, CSS modules, or custom CSS. Inline styles lose hover states, opacity modifiers, and responsive breakpoints.
+- **Configure Tailwind theme first.** Stitch HTML relies on custom theme colors (e.g., `bg-primary/20`). Ensure the Tailwind config defines all design system colors from DESIGN.md before implementing components.
+- **No custom CSS.** Use Tailwind exclusively. If you need a Stitch class that doesn't resolve, fix the Tailwind config — don't replace the class.
+- **Don't deviate from the design.** If the design conflicts with requirements, flag it — don't silently "improve" it.

package/skills/tdd/SKILL.md ADDED Viewed

@@ -0,0 +1,115 @@
+---
+name: tdd
+description: Test-driven development with red-green-refactor loop. Use when building features or fixing bugs using TDD, mentions "red-green-refactor", wants integration tests, or asks for test-first development.
+---
+# Test-Driven Development
+## Philosophy
+Examples in this skill use framework-neutral pseudocode. Translate test syntax/assertions to the project's language and test runner.
+**Core principle**: Test at system boundaries, not internal modules. Mock only what you don't control.
+Every system has two testable boundaries:
+1. **Server/backend boundary** — test through the real runtime or framework test harness (HTTP handlers, message handlers, queue consumers). Use real storage, real state.
+2. **Client/frontend boundary** — test at the route/page level. Mock the network edge (HTTP/WebSocket), but render real components with real stores and real hooks.
+Internal modules (stores, hooks, services, helpers) get covered transitively by boundary tests. Don't test them separately — if a store has a bug, a route-level test that exercises the same behavior will catch it.
+**Unit test only pure algorithmic functions** where the math matters (rounding, scoring, splitting, validation). Everything else goes through a boundary.
+**Good tests** are integration-style: they exercise real code paths through public APIs. They describe _what_ the system does, not _how_ it does it.
+**Bad tests** are coupled to implementation. They mock internal collaborators, test private methods, or verify through external means. The warning sign: your test breaks when you refactor, but behavior hasn't changed.
+See [tests.md](tests.md) for boundary examples, what not to test, and [mocking.md](mocking.md) for mocking guidelines.
+## Anti-Pattern: Horizontal Slices
+**DO NOT write all tests first, then all implementation.** This is "horizontal slicing" — treating RED as "write all tests" and GREEN as "write all code."
+**Correct approach**: Vertical slices via tracer bullets. One test → one implementation → repeat. Each test responds to what you learned from the previous cycle.
+```
+WRONG (horizontal):
+  RED:   test1, test2, test3, test4, test5
+  GREEN: impl1, impl2, impl3, impl4, impl5
+RIGHT (vertical):
+  RED→GREEN: test1→impl1
+  RED→GREEN: test2→impl2
+  RED→GREEN: test3→impl3
+  ...
+```
+## Workflow
+### 1. Planning
+Before writing any code:
+- Confirm what interface changes are needed
+- Confirm which behaviors to test (prioritize)
+- Identify opportunities for [deep modules](deep-modules.md) (small interface, deep implementation)
+- Design interfaces for [testability](interface-design.md)
+- List the behaviors to test (not implementation steps)
+**You can't test everything.** Focus testing effort on critical paths and complex logic, not every possible edge case.
+### 2. Tracer Bullet
+Write ONE test that confirms ONE thing about the system:
+```
+RED:   Write test for first behavior → test fails
+GREEN: Write minimal code to pass → test passes
+```
+This is your tracer bullet — proves the path works end-to-end.
+### 3. Incremental Loop
+For each remaining behavior:
+```
+RED:   Write next test → fails
+GREEN: Minimal code to pass → passes
+```
+Rules:
+- One test at a time
+- Only enough code to pass current test
+- Don't anticipate future tests
+- Keep tests focused on observable behavior
+### 4. Refactor
+After all tests pass, look for [refactor candidates](refactoring.md):
+- Extract duplication
+- Deepen modules (move complexity behind simple interfaces)
+- Apply SOLID principles where natural
+- Run tests after each refactor step
+**Never refactor while RED.** Get to GREEN first.
+### 5. Boundary Verification
+After all unit tests pass, ask: **"Did any test mock a system boundary?"**
+If yes, the mock encodes invisible assumptions about the other side. For each mocked boundary:
+1. **Name the assumption**
+2. **Verify it** — Write one test that uses the real system to confirm the assumption
+3. **If you can't verify it** — Write a contract test
+**Rule of thumb**: If your test mocks something, you need another test that doesn't.
+## Checklist Per Cycle
+```
+[ ] Test describes behavior, not implementation
+[ ] Test uses public interface only
+[ ] Test would survive internal refactor
+[ ] Code is minimal for this test
+[ ] No speculative features added
+```

package/skills/tdd/deep-modules.md ADDED Viewed

@@ -0,0 +1,33 @@
+# Deep Modules
+From "A Philosophy of Software Design":
+**Deep module** = small interface + lots of implementation
+```
+┌─────────────────────┐
+│   Small Interface   │  ← Few methods, simple params
+├─────────────────────┤
+│                     │
+│                     │
+│  Deep Implementation│  ← Complex logic hidden
+│                     │
+│                     │
+└─────────────────────┘
+```
+**Shallow module** = large interface + little implementation (avoid)
+```
+┌─────────────────────────────────┐
+│       Large Interface           │  ← Many methods, complex params
+├─────────────────────────────────┤
+│  Thin Implementation            │  ← Just passes through
+└─────────────────────────────────┘
+```
+When designing interfaces, ask:
+- Can I reduce the number of methods?
+- Can I simplify the parameters?
+- Can I hide more complexity inside?

package/skills/tdd/interface-design.md ADDED Viewed

@@ -0,0 +1,31 @@
+# Interface Design for Testability
+Good interfaces make testing natural:
+1. **Accept dependencies, don't create them**
+   ```text
+   // Testable
+   function processOrder(order, paymentGateway) {}
+   // Hard to test
+   function processOrder(order) {
+     const gateway = new StripeGateway();
+   }
+   ```
+2. **Return results, don't produce side effects**
+   ```text
+   // Testable
+   function calculateDiscount(cart) -> discount
+   // Hard to test
+   function applyDiscount(cart) {
+     cart.total -= discount;
+   }
+   ```
+3. **Small surface area**
+   - Fewer methods = fewer tests needed
+   - Fewer params = simpler test setup

package/skills/tdd/mocking.md ADDED Viewed

@@ -0,0 +1,86 @@
+# When to Mock
+Mock at **system boundaries** only:
+- External APIs (payment, email, etc.)
+- Databases (sometimes - prefer test DB)
+- Time/randomness
+- File system (sometimes)
+Don't mock:
+- Your own classes/modules
+- Internal collaborators
+- Anything you control
+## Designing for Mockability
+At system boundaries, design interfaces that are easy to mock:
+**1. Use dependency injection**
+Pass external dependencies in rather than creating them internally:
+```text
+// Easy to mock
+function processPayment(order, paymentClient) {
+  return paymentClient.charge(order.total);
+}
+// Hard to mock
+function processPayment(order) {
+  const client = new StripeClient(process.env.STRIPE_KEY);
+  return client.charge(order.total);
+}
+```
+**2. Prefer SDK-style interfaces over generic fetchers**
+Create specific functions for each external operation instead of one generic function with conditional logic:
+```text
+// GOOD: Each function is independently mockable
+api = {
+  getUser: (id) => fetch(`/users/${id}`),
+  getOrders: (userId) => fetch(`/users/${userId}/orders`),
+  createOrder: (data) => fetch('/orders', { method: 'POST', body: data }),
+}
+// BAD: Mocking requires conditional logic inside the mock
+api = {
+  fetch: (endpoint, options) => fetch(endpoint, options),
+}
+```
+The SDK approach means:
+- Each mock returns one specific shape
+- No conditional logic in test setup
+- Easier to see which endpoints a test exercises
+- Stronger contracts per endpoint (typed or documented)
+## The Boundary Rule
+Your system has boundaries — edges where your code talks to something you don't control. Mock **at** those edges, never inside.
+Examples of boundaries (mock these):
+- Network calls (HTTP, WebSocket, gRPC) from client to server
+- External service SDKs (payment, email, auth providers)
+- Databases when a test DB isn't practical (prefer real test DBs)
+- Time, randomness, filesystem
+Examples of internals (never mock these):
+- Your own stores, reducers, or state management
+- Your own hooks, composables, or services
+- Your own utility/helper modules
+- Internal class collaborators
+**If you're mocking something you wrote, you're testing implementation, not behavior.** The test will break on refactor and catch nothing real.
+## Mock Fidelity
+Mocks lie. Every mock encodes assumptions about external behavior. When those assumptions are wrong, tests pass but the app breaks.
+**After writing tests with mocks, verify mock fidelity:**
+- Does your mock return the same data shapes the real system returns?
+- Does your mock follow the same timing/ordering the real system follows?
+- Write at least one test against the real system (or realistic test double) per mocked boundary.

package/skills/tdd/refactoring.md ADDED Viewed

@@ -0,0 +1,10 @@
+# Refactor Candidates
+After TDD cycle, look for:
+- **Duplication** → Extract function/class
+- **Long methods** → Break into private helpers (keep tests on public interface)
+- **Shallow modules** → Combine or deepen
+- **Feature envy** → Move logic to where data lives
+- **Primitive obsession** → Introduce value objects
+- **Existing code** the new code reveals as problematic

package/skills/tdd/tests.md ADDED Viewed

@@ -0,0 +1,98 @@
+# Good and Bad Tests
+## Good Tests
+**Integration-style**: Test through real interfaces, not mocks of internal parts.
+```text
+// GOOD: Tests observable behavior
+test "user can checkout with valid cart":
+  cart = create_cart()
+  cart.add(product)
+  result = checkout(cart, payment_method)
+  assert result.status == "confirmed"
+```
+Characteristics:
+- Tests behavior users/callers care about
+- Uses public API only
+- Survives internal refactors
+- Describes WHAT, not HOW
+- One logical assertion per test
+## Bad Tests
+**Implementation-detail tests**: Coupled to internal structure.
+```text
+// BAD: Tests implementation details
+test "checkout calls payment service internals":
+  payment_spy = spy(payment_service)
+  checkout(cart, payment_spy)
+  assert payment_spy.process_called_with(cart.total)
+```
+Red flags:
+- Mocking internal collaborators
+- Testing private methods
+- Asserting on call counts/order
+- Test breaks when refactoring without behavior change
+- Test name describes HOW not WHAT
+- Verifying through external means instead of interface
+```text
+// BAD: Bypasses interface to verify
+test "create_user saves to database":
+  create_user(name="Alice")
+  row = db.query("SELECT * FROM users WHERE name = ?", ["Alice"])
+  assert row is not null
+// GOOD: Verifies through interface
+test "create_user makes user retrievable":
+  user = create_user(name="Alice")
+  retrieved = get_user(user.id)
+  assert retrieved.name == "Alice"
+```
+## What NOT to Test
+**Test at system boundaries, not internal modules.** A system has two boundaries:
+1. **Server/backend boundary**: Test through the real runtime or framework test harness (HTTP requests, WebSocket messages, queue handlers). Exercise real storage, real state, real protocol handling.
+2. **Client/frontend boundary**: Test at the route/page level with external dependencies mocked at the edge (e.g., mock the network layer, not your own stores or hooks).
+Tests at these two levels cover your internal modules (stores, hooks, services, helpers) transitively. If a store has a bug, a route-level test that exercises the store's behavior will catch it.
+**Do not write separate tests for:**
+- **State management** (stores, reducers, state machines) — covered by route/page tests that trigger the same state transitions through user interactions
+- **Custom hooks / composables** — covered by route/page tests that use the hook through a real component
+- **Individual UI components** — covered by route/page tests that render the full page including those components
+- **Config files** (CI workflows, bundler config, deploy config) — not behavioral; breaks when config format changes, catches nothing useful
+- **Design tokens / CSS classes** — testing class name presence doesn't verify visual fidelity; either trust the design system or use visual regression tools
+**Do write separate unit tests for:**
+- **Pure algorithmic functions** where the math matters (rounding, scoring, splitting, fuzzy matching, validation logic). These have complex edge cases that are cheaper to test in isolation.
+```text
+// BAD: Testing internal state management separately
+test "store updates count on increment message":
+  store = create_store()
+  store.handle_message({ type: "increment" })
+  assert store.state.count == 1
+// GOOD: Testing the same behavior through the UI boundary
+test "user sees updated count after server sends increment":
+  render(CounterPage, { websocket: mock_ws })
+  mock_ws.receive({ type: "increment" })
+  assert screen.has_text("Count: 1")
+// GOOD: Pure algorithm deserves its own unit test
+test "proportional split rounds to exact total using largest-remainder":
+  result = split_proportional(total=100, weights=[1, 1, 1])
+  assert sum(result) == 100
+  assert result == [34, 33, 33]
+```