npm - bossbuild - Versions diffs - 0.97.0 - Mend

bossbuild 0.97.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (128) hide show

package/LICENSE +21 -0
package/PRINCIPLES.md +70 -0
package/README.md +213 -0
package/VERSION +1 -0
package/bin/boss +3 -0
package/library/README.md +19 -0
package/library/agents/.gitkeep +0 -0
package/library/agents/mentor-venture.md +57 -0
package/library/hooks/.gitkeep +0 -0
package/library/hooks/auto-log.js +133 -0
package/library/hooks/memory-cue.js +82 -0
package/library/hooks/secrets-guard.js +87 -0
package/library/memory-seed/README.md +29 -0
package/library/memory-seed/durable-facts-example.md +16 -0
package/library/practices/.gitkeep +0 -0
package/library/practices/agent-security.md +111 -0
package/library/practices/ai-adoption-culture.md +104 -0
package/library/practices/ai-ux-patterns.md +246 -0
package/library/practices/celebration-of-done.md +100 -0
package/library/practices/conscience-voicing.md +121 -0
package/library/practices/context-discipline.md +116 -0
package/library/practices/design-system.md +152 -0
package/library/practices/git-workflow.md +119 -0
package/library/practices/harm-taxonomy.md +45 -0
package/library/practices/quality-ratchet.md +48 -0
package/library/practices/revalidation.md +57 -0
package/library/practices/scalable-architecture.md +111 -0
package/library/practices/ship-it-live.md +149 -0
package/library/practices/skill-authoring.md +70 -0
package/library/skills/.gitkeep +0 -0
package/library/skills/boss-learn/SKILL.md +63 -0
package/library/skills/boss-sync/SKILL.md +48 -0
package/package.json +49 -0
package/registry/CHANGELOG.md +2737 -0
package/src/board.js +655 -0
package/src/brain.js +288 -0
package/src/cli.js +542 -0
package/src/conscience.js +426 -0
package/src/insights.js +147 -0
package/src/learn.js +92 -0
package/src/map.js +103 -0
package/src/modes.js +82 -0
package/src/paths.js +36 -0
package/src/registry.js +34 -0
package/src/scaffold.js +138 -0
package/src/sync.js +292 -0
package/src/team.js +103 -0
package/stages/L0-quickstart/manifest.json +12 -0
package/stages/L0-quickstart/template/.claude/agents/coder-generalist.md +31 -0
package/stages/L0-quickstart/template/.claude/agents/mentor-venture.md +57 -0
package/stages/L0-quickstart/template/.claude/agents/pm.md +28 -0
package/stages/L0-quickstart/template/.claude/hooks/conscience.js +89 -0
package/stages/L0-quickstart/template/.claude/hooks/lib/loop-runtime.js +507 -0
package/stages/L0-quickstart/template/.claude/hooks/lib/yaml.js +163 -0
package/stages/L0-quickstart/template/.claude/hooks/memory-cue.js +82 -0
package/stages/L0-quickstart/template/.claude/hooks/secrets-guard.js +87 -0
package/stages/L0-quickstart/template/.claude/rules/your-app-code.md +17 -0
package/stages/L0-quickstart/template/.claude/settings.json +36 -0
package/stages/L0-quickstart/template/.claude/skills/boss/SKILL.md +161 -0
package/stages/L0-quickstart/template/.claude/skills/boss-learn/SKILL.md +63 -0
package/stages/L0-quickstart/template/.claude/skills/boss-sync/SKILL.md +55 -0
package/stages/L0-quickstart/template/.claude/skills/canvas/SKILL.md +112 -0
package/stages/L0-quickstart/template/.claude/skills/comprehend/SKILL.md +72 -0
package/stages/L0-quickstart/template/.claude/skills/decide/SKILL.md +122 -0
package/stages/L0-quickstart/template/.claude/skills/feedback/SKILL.md +68 -0
package/stages/L0-quickstart/template/.claude/skills/import/SKILL.md +73 -0
package/stages/L0-quickstart/template/.claude/skills/persona/SKILL.md +92 -0
package/stages/L0-quickstart/template/.claude/skills/prototype/SKILL.md +114 -0
package/stages/L0-quickstart/template/.claude/skills/triage/SKILL.md +104 -0
package/stages/L0-quickstart/template/.claude/skills/welcome/SKILL.md +262 -0
package/stages/L0-quickstart/template/AGENTS.md +31 -0
package/stages/L0-quickstart/template/CLAUDE.md +57 -0
package/stages/L0-quickstart/template/docs/IDS.md +42 -0
package/stages/L0-quickstart/template/docs/ideas/INDEX.md +24 -0
package/stages/L0-quickstart/template/docs/loops/canvas-loop.md +90 -0
package/stages/L0-quickstart/template/docs/loops/capture-loop.md +64 -0
package/stages/L1-mvp/manifest.json +12 -0
package/stages/L1-mvp/template/.claude/agents/mentor-architect.md +124 -0
package/stages/L1-mvp/template/.claude/agents/mentor-cofounder.md +85 -0
package/stages/L1-mvp/template/.claude/agents/mentor-gtm.md +49 -0
package/stages/L1-mvp/template/.claude/agents/program-manager.md +46 -0
package/stages/L1-mvp/template/.claude/agents/tester.md +42 -0
package/stages/L1-mvp/template/.claude/hooks/auto-log.js +133 -0
package/stages/L1-mvp/template/.claude/rules/feature-context.md +18 -0
package/stages/L1-mvp/template/.claude/skills/ai-cost/SKILL.md +249 -0
package/stages/L1-mvp/template/.claude/skills/ai-failure-states/SKILL.md +226 -0
package/stages/L1-mvp/template/.claude/skills/ai-first-init/SKILL.md +227 -0
package/stages/L1-mvp/template/.claude/skills/close/SKILL.md +170 -0
package/stages/L1-mvp/template/.claude/skills/consult/SKILL.md +72 -0
package/stages/L1-mvp/template/.claude/skills/cost-review/SKILL.md +204 -0
package/stages/L1-mvp/template/.claude/skills/design-tokens-init/SKILL.md +192 -0
package/stages/L1-mvp/template/.claude/skills/drift-deep/SKILL.md +170 -0
package/stages/L1-mvp/template/.claude/skills/evals/SKILL.md +154 -0
package/stages/L1-mvp/template/.claude/skills/extract/SKILL.md +209 -0
package/stages/L1-mvp/template/.claude/skills/judge-traces/SKILL.md +68 -0
package/stages/L1-mvp/template/.claude/skills/log/SKILL.md +64 -0
package/stages/L1-mvp/template/.claude/skills/practice/SKILL.md +92 -0
package/stages/L1-mvp/template/.claude/skills/pretotype/SKILL.md +95 -0
package/stages/L1-mvp/template/.claude/skills/red-team/SKILL.md +137 -0
package/stages/L1-mvp/template/.claude/skills/revalidate/SKILL.md +51 -0
package/stages/L1-mvp/template/.claude/skills/ship/SKILL.md +105 -0
package/stages/L1-mvp/template/.claude/skills/smoke/SKILL.md +43 -0
package/stages/L1-mvp/template/.claude/skills/spec/SKILL.md +145 -0
package/stages/L1-mvp/template/claude-append.md +122 -0
package/stages/L1-mvp/template/docs/loops/ai-failure-state-loop.md +107 -0
package/stages/L1-mvp/template/docs/loops/coordination-loop.md +116 -0
package/stages/L1-mvp/template/docs/loops/cost-budget-loop.md +117 -0
package/stages/L1-mvp/template/docs/loops/cost-review-loop.md +113 -0
package/stages/L1-mvp/template/docs/loops/design-tokens-loop.md +98 -0
package/stages/L1-mvp/template/docs/loops/drift-loop.md +149 -0
package/stages/L1-mvp/template/docs/loops/extraction-loop.md +128 -0
package/stages/L1-mvp/template/docs/loops/focus-loop.md +106 -0
package/stages/L1-mvp/template/docs/loops/pretotype-loop.md +88 -0
package/stages/L1-mvp/template/docs/loops/spec-loop.md +83 -0
package/stages/L2-v1/manifest.json +12 -0
package/stages/L2-v1/template/.claude/agents/db-architect.md +91 -0
package/stages/L2-v1/template/.claude/agents/mentor-business.md +124 -0
package/stages/L2-v1/template/.claude/agents/mentor-fundraising.md +72 -0
package/stages/L2-v1/template/.claude/agents/mentor-pitch.md +84 -0
package/stages/L2-v1/template/.claude/agents/mentor-talent.md +84 -0
package/stages/L2-v1/template/.claude/agents/ui-designer.md +81 -0
package/stages/L2-v1/template/.claude/agents/ux-designer.md +87 -0
package/stages/L2-v1/template/.claude/skills/board/SKILL.md +98 -0
package/stages/L2-v1/template/.claude/skills/design-review/SKILL.md +77 -0
package/stages/L2-v1/template/.claude/skills/ux-check/SKILL.md +93 -0
package/stages/L2-v1/template/claude-append.md +59 -0
package/stages/L2-v1/template/docs/loops/design-drift-loop.md +108 -0
package/stages/L3-scale/README.md +13 -0

package/stages/L1-mvp/template/.claude/skills/evals/SKILL.md ADDED Viewed

@@ -0,0 +1,154 @@
+---
+name: evals
+description: Build and run the eval set for an AI-mediated FEAT — "is it correct?" paired with /smoke's "is it alive?" Husain's discipline applied to LLM-mediated control-flow in {{PROJECT_NAME}} — look at your data, build the eval set FIRST, categorize failures by mode, vibes-based eval is only a starting point. Usage - /evals [FEAT-NNN | --new <feat>]
+---
+# /evals — the AI-correctness gate
+`/smoke` answers *is the app alive?* `/evals` answers *is the AI part correct?* In MVP mode, every
+FEAT that puts an LLM in the control flow needs an eval set — even a tiny one. The discipline is
+**Husain's**: *almost all AI quality problems are visible in the data, and almost no one looks.*
+This skill is "look."
+The MVP rule: **20 examples beats 0.** Even an unsophisticated eval set caught vibes earlier than
+no eval set ever could. Categorize failures so the next iteration can target a category, not just
+"make it better."
+## Correctness ≠ safety — the adversarial half
+A clean `/evals` pass means the AI part is *correct* — not that it's *safe*. Safety holds under normal
+use and **degrades under adversarial / jailbreak prompts across every model** (Stanford HAI AI Index
+2026: 362 AI incidents in 2025, up from 233; the industry reports capability benchmarks and almost never
+the responsibility half). So an `/evals` run isn't *done* until **`/red-team`** has run the adversarial
+half — its OWASP-Agentic battery is what that pass should probe. Correctness is `/evals`; resistance
+under attack is `/red-team`; shipping an AI feature needs **both**, not either alone.
+## When to run it
+- A FEAT-NNN involves an LLM call whose output drives behaviour (not just user-facing prose).
+- *Before* you ship the FEAT — eval-set-FIRST, not retrofitted.
+- After a real-user bug report — add the failing case to the eval set, then fix.
+- Regression — after any model swap, prompt change, or upstream library bump, re-run the set.
+## How to run it
+1. **Pick the FEAT.** Read the FEAT spec's *Evals* section. If it doesn't exist (FEAT pre-dates the
+   v0.21.0 spec template), add it: declare an `Eval set path: docs/evals/FEAT-NNN.yml`.
+2. **Create or open** `docs/evals/FEAT-NNN.yml`. Format: a YAML list of cases, each with
+   `id / category / scenario / inputs / expected / why`. Categorize `should-pass` and
+   `should-fail` (the latter sub-categorized by failure mode per Husain's discipline —
+   `over-applies / hallucinates / refuses-wrongly / format-violation / etc.`).
+3. **Run them.** Either:
+   - A runner script (Node, ~150 lines like BOSS's own `docs/architecture/conscience-evals/
+     runner.js`) — preferred, machine-runnable, becomes a CI gate
+   - Or manually walk each case, recording pass/fail and what failed
+4. **Report.** One line per case: `✓ <id>  <scenario>` or `✗ <id>  → <reason>`. Summary table by
+   failure category — the categorized failure count IS the design signal for next iteration.
+5. **Add the failure** when something breaks live. Each real-world bug becomes a case in the set;
+   the set grows from real friction.
+## Eval set format (recommended)
+```yaml
+- id: feat-007-pass-001
+  category: should-pass
+  scenario: <one-line natural description>
+  inputs:
+    <synthetic state / prompt / context>
+  expected:
+    <pass/fail criteria — what "right" looks like>
+  why: <why this case matters>
+- id: feat-007-fail-001
+  category: should-fail
+  failure_mode: hallucination  # one of: garbage | refusal | hallucination | timeout | cost-spike | other
+  scenario: <one-line>
+  inputs: …
+  expected:
+    fires: true  # the detector / guard / refusal catches this
+  why: <what could go wrong if missed>
+```
+Same shape as BOSS's own conscience-evals — copy that runner.js as a starting point if Node fits
+your stack.
+## Failure-state coverage requirement (v0.30.0+ — for AI-mediated FEATs)
+For any FEAT that puts an LLM in the user-visible path (i.e., one that declares responses in
+`docs/ai-failure-states.md`), the eval set **must include at least one `should-fail` case
+per declared failure state**, categorized by `failure_mode` matching the canonical names:
+- `garbage` — schema-invalid / malformed / off-topic output
+- `refusal` — safety-template / over-cautious non-answer / "I can't help with that"
+- `hallucination` — confidently-wrong content (fabricated citations, invented APIs, etc.)
+- `timeout` — call hangs / network drops / 5xx response
+- `cost-spike` — input or output token count exceeds declared per-call cap
+This closes the *"failure-state handlers can be stubs forever"* loophole. The handler stubs
+in `src/` satisfy the `ai-failure-state-loop` predicate; the eval cases here verify the stubs
+*actually do something* when the failure surfaces.
+A handler whose `Eval-tested` field in `docs/ai-failure-states.md` reads `STUB` means the
+founder has committed to writing this eval case — either now, or under a recorded override
+(IDEA-008) with a re-open condition. **STUB without an override is the failure mode this
+upgrade prevents.**
+Cohort-aware: `first-product` may legitimately ship STUB + override on day-one builds
+(*"haven't seen this failure yet; will write the case when FEAT-002 ships"*); `domain-expert`
+in high-stakes domains should not ship STUB without an external escalation route documented.
+## Sharpening (2026 — Hamel Husain / Shreya Shankar)
+The 2026 update to the eval discipline, from the people who teach it. Fold these in:
+- **Error analysis comes first — on *real* traces, not invented cases.** Read your actual
+  session/agent traces, sort the failures into a taxonomy, *then* build evaluators for the modes you
+  actually see. Inventing eval cases before you've looked at real failures is "eval-driven
+  development" done backwards. (If the project runs BOSS's `auto-log` trace substrate, `.boss/trace.jsonl`
+  is exactly this raw material — IDEA-025.) Error analysis is 60–80% of the work.
+- **Binary pass/fail, not 1–5 scores.** A Likert score hides the decision. Force each case to a
+  yes/no — "did it do the thing or not" — and let the *categorized* failures carry the nuance. Scores
+  feel rigorous and measure nothing.
+- **One expert, not a committee.** Pick a single "benevolent dictator" who owns what pass means for
+  this FEAT. Averaging three people's vibes produces mush. (For high-stakes domains, that expert is a
+  domain expert — see the rule below.)
+- **Don't let the model grade its own homework.** If you use LLM-as-judge, the judge must be a
+  *separate* pass from the call that produced the output, with its own examples of the judge being
+  wrong. A right answer reached through a bad/dangerous tool path is still a failure — evaluate the
+  *trajectory*, not just the endpoint. (For a FEAT with a real multi-step tool flow, assert on the
+  tool *sequence*, not only the final output; **UK AISI's Inspect** is the graduation-grade harness to
+  reach for if you outgrow hand-rolled trajectory checks — point at it, don't rebuild it.)
+- **Cost hierarchy.** Cheap deterministic assertions first; reserve LLM-as-judge for the persistent,
+  genuinely-semantic failures. A useful mix to aim for: ~60% deterministic / ~30% LLM-as-judge /
+  ~10% human-in-the-loop.
+- **Reliability ≠ single-shot — measure pass^k.** Non-deterministic output means a case that passes
+  once can fail the next call. Run each load-bearing case **k times and count how often it *all*
+  succeeds** (pass^k, from τ-bench) — that consistency rate, not one green run, is the real reliability
+  signal. Zero-dependency: a loop around the case you already wrote. (τ-bench / UK AISI Inspect are the
+  graduate-grade harnesses to point at when you outgrow it, never a CLI dependency.)
+## Structured outputs (Liu discipline) — strongly recommended
+If the LLM call's output drives subsequent code (control flow, data routing, decisions), **schema
+the output**. Pydantic-first / Zod-first / any-structured-output. Free-form prose is for human
+eyeballs only. The eval set then asserts on the *schema*, not on prose interpretation. This single
+practice eliminates ~80% of LLM-pipeline brittleness (Jason Liu).
+## Rules
+- **Eval-set first.** Write 20+ cases before the LLM call ships, not retrofitted after. If you
+  don't know what 20 cases are, you don't yet know what the FEAT does.
+- **Categorize failures.** Failure-mode categorization is more valuable than success count.
+  Husain: *failure modes are more valuable than success modes — categorize systematically.*
+- **Vibes are a starting point.** First 5 iterations on vibes is fine; the 6th wants rubrics.
+- **Look at your data.** When a case fails, *open the actual model output side-by-side with the
+  prompt*. The fault is almost always visible.
+- **LLM-as-judge is fine but only with examples of the judge being wrong.** Without that,
+  you've added a confidently-wrong layer on top of a confidently-wrong layer.
+- **Domain expertise beats LLM eval** when stakes are real (medical, legal, financial). A
+  doctor reviewing 30 outputs beats GPT reviewing 3,000 in the same domain.
+- **Eval is not CI — yet.** Run on commit when the cost is small. Run on every commit when the
+  cost is justified. Decide deliberately.
+- **Correctness isn't safety.** A green `/evals` pass isn't *done* until `/red-team`'s adversarial pass
+  runs — safety degrades under attack even when correctness holds (AI Index 2026). Two rails, both required.
+- **Non-deterministic ≠ run-once.** Measure consistency across k trials (pass^k), not a single happy-path run.

package/stages/L1-mvp/template/.claude/skills/extract/SKILL.md ADDED Viewed

@@ -0,0 +1,209 @@
+---
+name: extract
+description: Pause and sort patterns — PRINCIPLE #1's discipline as a skill. Reads recent work (git log, devlog, src/, library/) and proposes 1-3 candidate extractions, each routed UP (into BOSS's library/<cat>/ for reuse across projects via /boss-learn) or DOWN (into the app's own core code). Records the decision in docs/extractions/EXTR-NNN-*.md. The LLM-as-judge counterpart to the predicate-based loops — uses Claude's reading, not regex. Usage - /extract
+---
+# /extract — pause and sort the pattern
+> *"BOSS is always scaffolding, but scaffolding is the **motion**, not the goal. At every
+> natural breakpoint — a mode transition, a shipped feature, the third time the same work
+> repeats — **pause and sort the pattern two ways:** UP into BOSS as a reusable superset
+> practice. DOWN into the app as core functionality."* — PRINCIPLE #1.
+The conscience's other moments fire on predicate matches (regex over files). This one needs
+**judgment** — *"is this pattern actually reusable?"* — which a regex can't see. /extract is
+the LLM-as-judge counterpart: Claude reads recent work and proposes specific routes.
+The discipline says **two destinations**, not one:
+- **UP** → `library/<category>/` in the BOSS source repo. Every future project inherits it.
+  Promoted via `boss learn <path> --as <cat>`. *"This pattern is reusable across projects."*
+- **DOWN** → into the app's core code (refactor inline duplication into a named module,
+  function, or schema in `src/`). *"This pattern is product, not scaffold."*
+A pattern can also be **neither yet** — the third honest answer. Recording the *not yet* IS
+the discipline; pretending everything is extractable is the failure mode.
+## When to run it
+- The conscience surfaced the `capture` moment (extraction-loop opened — heuristic says ≥3
+  devlog entries, no extraction recorded yet). The loop's nudge is the prompt; run this skill
+  to act on it.
+- After a FEAT ships (a natural breakpoint per PRINCIPLE #1).
+- After `boss unlock` lands a new mode (another natural breakpoint).
+- The *third time* you find yourself writing similar code or skill prompt or copy-paste-then-
+  edit (the *"three repetitions"* signal).
+- Mid-session when something feels reusable but you can't name where it should live.
+## How to run it
+### 1. Orient on recent work
+Read, silently:
+- Last 10-20 commits (`git log --oneline -20`)
+- The most recent 3-5 devlog entries (`docs/devlog.md` — newest first)
+- The current FEAT (if any) and its acceptance criteria
+- `library/` listing in the BOSS source repo (if accessible) — what's already extracted UP
+- `src/` directory structure — what's accumulating DOWN
+Don't announce these reads. Just orient.
+### 2. Identify 1-3 candidate patterns
+Look for one of three signals (these are the same signals PRINCIPLE #1 names):
+**Signal A — same work repeated.** A skill prompt rewritten three times across FEATs. A
+component-shape copied to three pages. A test fixture duplicated across three test files. A
+helper function reinvented twice.
+**Signal B — a named-and-stable shape.** A pattern you've started referring to by name in
+devlog or commits — *"the cohort-aware framing,"* *"the structured signal,"* *"the
+fallback handler stub."* Names indicate the pattern has earned identity.
+**Signal C — a load-bearing decision.** A choice that other choices depend on — a schema
+shape, a prompt convention, a file-layout pattern, a test harness. Even if it only exists
+once, its *role* is foundational.
+For each candidate, name it in plain language. Don't propose more than 3 — pick the most
+load-bearing ones. **It's fine to find zero** — record that explicitly.
+### 3. Route each candidate (UP / DOWN / NOT-YET)
+For each candidate, ask the routing questions:
+| Question | UP signal | DOWN signal | NOT-YET signal |
+|---|---|---|---|
+| **Can a sibling project reuse this without copy-paste?** | Yes | No (project-specific) | Maybe with rework |
+| **Does this depend on this project's domain?** | No (stack-neutral) | Yes (domain-bound) | Partially |
+| **Has it been used 3+ times in this project?** | Yes (battle-tested) | Yes but won't generalize | No (single use) |
+| **Is the value in the pattern, or in the code?** | The pattern | The code | Unclear |
+| **Where does future-you look for it?** | Across projects (BOSS library) | Within this project (src/) | Not findable yet |
+Route each candidate based on the dominant signal:
+- **UP** → `library/skills/`, `library/agents/`, `library/practices/`, `library/hooks/`, or
+  `library/memory-seed/` in the BOSS source repo. Run `boss learn <path> --as <cat>` to
+  copy + bump VERSION + add CHANGELOG entry. (Requires `$BOSS_DEV` or `npm link`-installed
+  BOSS — see [`/boss-learn`](../boss-learn/SKILL.md).)
+- **DOWN** → refactor the duplication into a named module/function/schema in `src/`. /extract
+  doesn't execute the refactor (the founder owns the code); it names the target file path +
+  the smallest valuable refactor.
+- **NOT-YET** → record the candidate + reason. *"Worth watching; not yet load-bearing enough
+  / not yet generalizable / not yet used outside one feature."*
+### 4. Write `docs/extractions/EXTR-NNN-<slug>.md`
+Use this skeleton. The frontmatter + `Route` line make the file detectable by extraction-loop.
+```markdown
+---
+id: EXTR-NNN
+type: extraction
+owner: pm
+status: recorded
+created: {{DATE}}
+trigger: <devlog-3-entries | FEAT-NNN-shipped | mode-unlock | third-repetition | manual>
+---
+# EXTR-NNN — <one-line summary of the extraction>
+## Recent context
+_What was in the air at the breakpoint — last 3-5 devlog entries summarized in 2-3 sentences,
+last commits referenced. Future-you reads this to remember why this extraction happened._
+## Candidate 1: <name>
+- **What it is:** <one line>
+- **Where it lives now:** <file paths, scope>
+- **Route:** UP | DOWN | NOT-YET
+- **Rationale:** <why this route — answer the routing questions briefly>
+- **If UP:** target `library/<category>/<name>` — run `boss learn <src-path> --as <cat>` next.
+- **If DOWN:** target `src/<path>` — refactor target named + the smallest valuable cut.
+- **If NOT-YET:** re-open condition: <what would change the answer to UP or DOWN>.
+## Candidate 2: <name>
+(same structure)
+## Candidate 3: <name>
+(same structure — omit if fewer than 3 candidates)
+## What didn't make the cut
+_Patterns you considered and explicitly rejected. Naming what's NOT an extraction is half the
+discipline — it prevents over-extraction the next time the loop fires._
+- <pattern> — <why not>
+## Notes
+- Source devlog entries: <date range>
+- Related FEATs: <FEAT-NNN refs>
+- BOSS version when this was recorded: <from VERSION>
+```
+### 5. Update IDS allocation
+Add a row in `docs/ideas/INDEX.md` (or wherever your project tracks IDs) under an *Extractions*
+heading; allocate the next free `EXTR-NNN` integer by grepping `docs/extractions/*.md`.
+### 6. If any candidates are UP — invoke `/boss-learn`
+For each UP-routed candidate, after the founder confirms:
+- Suggest the `boss learn <path> --as <cat>` invocation.
+- Hand off to `/boss-learn` for the two-way router judgment (it may re-route to DOWN if it
+  disagrees — that's the discipline's check on the discipline).
+- Record in the EXTR file whether the UP succeeded (bumped VERSION? CHANGELOG line written?).
+### 7. If any candidates are DOWN — name the refactor
+Don't execute the refactor inside `/extract` (that's `coder-generalist`'s job). Just:
+- Name the target file path and the smallest valuable cut.
+- Add a TODO to `docs/RESUME.md`'s Next-tasks: *"Refactor <name> per EXTR-NNN."*
+- Hand off if the founder wants to start the refactor now.
+### 8. If all are NOT-YET — record honestly
+NOT-YET is a legitimate answer for an extraction record. The `- **Route:** NOT-YET` line
+still closes extraction-loop (the discipline IS the practice of pausing-and-routing, not the
+volume of extractions). Future-you reads this file and sees: *"BOSS made me look; I looked
+honestly; nothing was extractable yet."* That's the principle working.
+## Cohort-aware delivery
+| Cohort | Posture |
+|---|---|
+| `first-product` | Walk through the routing carefully. Define UP / DOWN inline (they may not know what `library/` is). Lean toward NOT-YET for early projects; the discipline of *looking* matters more than the volume of *finding*. |
+| `vibe-coder-newbie` | Show the three signals (A/B/C) with named examples from THIS project. Avoid abstract framings. |
+| `non-tech-founder` | Plain language. UP = "reusable across projects" / DOWN = "make it real in this product" / NOT-YET = "not the right time to extract." They likely own the routing call but may delegate the actual extraction. |
+| `eng-builder` | Terse. They'll spot extractables fast; the question is whether they'll *do* the work. The skill's job is to anchor the decision in the EXTR file, not to teach. |
+| `vibe-virtuoso` | They have a backlog of extractable patterns from past projects. The skill's leverage here is *"which of THESE three from THIS project is the load-bearing one?"* — not the full inventory. |
+| `indie-hacker` | Calm-company framing. UP is investment in the system; DOWN is investment in this product. NOT-YET is the most-used route — patience is the discipline. |
+| `returning-founder` | Skip the routing-question table; they know it. Ask: *"Three sessions in. What did you do twice? What did you almost do a third time?"* They'll name the candidates without prompting. |
+| `domain-expert` | High-stakes: extractions involving regulated logic (PHI handling, financial calculations, legal templates) lean NOT-YET-with-caveats *"this is too domain-specific to generalize; document the project-internal abstraction; do NOT lift to library/."* The default is conservative. |
+## Connection to other loops + skills
+- **Triggered by:** `extraction-loop` (this skill's job is to close it). The loop opens at
+  the heuristic breakpoint; the skill is the judgment.
+- **Routes to:** `/boss-learn` (for UP candidates) — see `library/` rules in the BOSS source
+  repo. `/boss-learn` is the two-destination router at the *promote* step; /extract is the
+  router at the *propose* step. They share PRINCIPLE #1; they sit at different inflection
+  points.
+- **Adjacent:** `/log` (devlog discipline produces the entry signal); `/close` (session-end
+  may surface "consider /extract" when devlog has accumulated entries).
+## What this skill is NOT
+- **Not a one-time ritual.** Re-run after each FEAT, each mode unlock, each *"third time"* signal.
+- **Not an automatic refactor.** The skill names the route; the founder (or `coder-generalist`)
+  owns the actual code change.
+- **Not a quality gate.** Skipping extraction doesn't fail anything. The override grammar
+  applies — record in devlog if you deliberately skip.
+## Rules
+- **NOT-YET is a legitimate answer.** Honest no-pattern-extractable beats inventing one.
+- **Two destinations, not one.** PRINCIPLE #1 is explicit: UP into BOSS or DOWN into app.
+  Treating extraction as one-way (always-UP) is the failure mode this skill exists to prevent.
+- **Three signals, not feelings.** Same-work-repeated / named-and-stable / load-bearing-
+  decision. If you can't tie the candidate to one of the three, it's premature.
+- **Record before route.** Write the EXTR-NNN file FIRST; invoke `/boss-learn` or refactor
+  AFTER. The record IS the discipline.
+- **Cite PRINCIPLE #1.** This skill exists to encode that principle. Naming it in the EXTR
+  file ties the discipline back to its source.
+- **At most three candidates per /extract run.** More than three means you're inventorying,
+  not pausing. Pick the load-bearing ones; let the rest re-surface naturally.

package/stages/L1-mvp/template/.claude/skills/judge-traces/SKILL.md ADDED Viewed

@@ -0,0 +1,68 @@
+---
+name: judge-traces
+description: Error analysis on your real session traces — the Hamel/Shankar discipline applied to your own work. Reads .boss/trace.jsonl (what agents actually did), helps you sort what went wrong into a binary pass/fail failure taxonomy, and routes the real failure modes to /boss-learn. The deliberate, founder-invoked reader for the auto-log trace substrate. Usage - /judge-traces [last N | all]
+---
+# /judge-traces — read your real traces, find the real failure modes
+The 2026 eval discipline (Hamel Husain / Shreya Shankar) is blunt: **error analysis on real traces is
+60–80% of the work, and almost no one looks.** `/judge-traces` is "look" — at *your* sessions, not at
+golden cases someone imagined. It reads the trace your work already leaves and helps you turn it into a
+failure taxonomy you can act on.
+This is the **deliberate reader** for the `auto-log` trace substrate (`library/hooks/auto-log.js`). The
+hook *collects* (passively, when enabled); this skill *judges* (deliberately, when you ask). They're
+kept apart on purpose — collection is never judgment.
+## Prerequisite (and graceful degrade)
+Reads `.boss/trace.jsonl`. If it doesn't exist or is empty, say so plainly and stop — don't invent
+data:
+> *"No trace yet. `auto-log` is dormant by default (a SubagentStop hook has a per-subagent cost). Turn
+> it on in `.claude/settings.json` (see `library/hooks/auto-log.js`) and the trace will accumulate as
+> agents do work. Come back here once there's something to read."*
+Never fabricate a taxonomy from no data. An honest "nothing to judge yet" is the correct output.
+## How to run it
+**1. Read the traces.** Load `.boss/trace.jsonl` (each line: `ts`, `session`, `agent`, `files`,
+`file_count`). Default to the last ~30; `all` reads everything; `last N` reads N.
+**2. Surface the shape, factually first.** Before judging: which agents did what, how often, on which
+files. This is the cheap deterministic pass — counts, not opinions. (Hamel's cost hierarchy: cheap
+assertions before any judgment.)
+**3. Help the founder do error analysis — binary, not scored.** Walk the traces and sort what you can
+see into **pass / fail**, never a 1–5 score. Failure modes to look for in agent work:
+- `wrong-files` — an agent touched files outside its lane (a coder editing docs, a doc agent editing src)
+- `thrash` — the same files churned across many sessions with no shipped outcome
+- `silent-scope-creep` — a small ask that fanned into a large diff
+- `no-trace-of-the-point` — sessions of work with nothing that maps to a named FEAT / risk
+- `<your own>` — the taxonomy is yours; name the modes *you* actually see
+**One expert, not a committee** (Hamel): *you* own what "fail" means for this project. Don't average
+opinions — make the call.
+**4. Don't let the judge grade its own homework.** Where a trace looks like a failure, the read is a
+*separate* pass from whatever produced it — and judge the **trajectory** (did the path make sense),
+not just whether the endpoint happened to be fine. A right outcome via a wrong path is still a finding.
+**5. Route the real modes.** A failure mode that recurs is a candidate for `/boss-learn` — UP (a
+practice/guard for every project) or DOWN (a fix in this app). A one-off is just a one-off; don't
+systematize noise. Name the count: *"`wrong-files` appeared 4× across 3 sessions — worth a guard."*
+## Output
+A short report: the factual shape (agents × files × frequency), then the binary failure taxonomy with
+counts, then the 1–3 modes worth routing to `/boss-learn`. Keep it to what the trace actually shows.
+## Rules
+- **No data → no taxonomy.** Honest "nothing yet" beats a fabricated one.
+- **Binary, not Likert.** Pass/fail forces the decision; scores hide it.
+- **Counts are the signal.** A failure that recurs matters; a one-off doesn't. Lead with frequency.
+- **Local-only.** The trace is this machine's; nothing leaves it. This skill reads, never sends.
+- **Collection ≠ judgment.** `auto-log` collects passively; this judges deliberately. Never fuse them
+  into an always-on auto-grader — that's the trap (the conscience would be grading every session
+  unprompted). Judgment is a thing you choose to do.

package/stages/L1-mvp/template/.claude/skills/log/SKILL.md ADDED Viewed

@@ -0,0 +1,64 @@
+---
+name: log
+description: Append a dated entry to docs/devlog.md — what landed this session, what's next, what surprised you. Lighter than commit messages, denser than CHANGELOG. The thing future-you reads before starting work. Usage - /log <one-line summary or detailed entry>
+---
+# /log — the devlog
+The devlog is the project's working memory between commits and RESUME. Commits answer *what
+changed in the code*; the devlog answers *what happened in the session* — including the decisions
+that didn't show up in a diff: a path tried and abandoned, an assumption updated, a person you
+talked to, a number you saw.
+If you only read one thing when picking the project back up, read the last devlog entry.
+## How to run it
+1. Open (or create) `docs/devlog.md`. If creating, seed with:
+   ```markdown
+   ---
+   id: DEVLOG
+   type: devlog
+   owner: pm
+   status: active
+   ---
+   # Devlog — {{PROJECT_NAME}}
+   Append-only. Newest at the top. Each entry: date, FEAT (if any), what landed, what's next.
+   ```
+2. Append a **new entry at the top** (under the header), dated today:
+   ```markdown
+   ## {{today}}
+   - **FEAT:** FEAT-NNN <name>  _(or "no FEAT — exploration/ops")_
+   - **Landed:** <one or two lines — what's now real that wasn't before>
+   - **Next:** <the very next thing — concrete, one or two lines>
+   - **Surprises / decisions:** <only if there was one — what changed in your model of the problem>
+   ```
+3. If the user gave you a one-liner, that's enough — fill it into **Landed**, leave **Next** empty
+   only if they didn't say. Don't fabricate. Blanks are honest.
+4. If a FEAT closed (acceptance criteria met + smoke green), also flip its status in the FEAT doc
+   and `docs/ideas/INDEX.md` to `shipped`.
+## What belongs in a devlog entry
+- The path you tried that didn't work and what you'd do instead.
+- The number you saw — a latency, a user count, a sign-up rate.
+- A conversation that changed the plan.
+- A decision the diff doesn't capture ("we picked X over Y because…").
+## What doesn't
+- A list of every file you edited. The diff already says that.
+- Commit messages copied in verbatim.
+- Plans for next week. Those go in `docs/RESUME.md` via `/close`.
+## Rules
+- Newest at the top, append-only — never edit old entries. (Wrong? Add a correction at today's entry.)
+- One entry per session, usually. Multiple short entries are fine if the day shifted gears.
+- Short is better than complete. A devlog you'll actually write beats one you won't.

package/stages/L1-mvp/template/.claude/skills/practice/SKILL.md ADDED Viewed

@@ -0,0 +1,92 @@
+---
+name: practice
+description: Capture a craft learning — a better way to build with AI you found — as a shared, attributed PRAC-NNN record your cofounder gets too. The team's commons for staying current on the fast-moving agentic-AI craft, so you both build on each other's discoveries instead of re-learning them. Staleness-aware: AI moves fast, so a practice can carry a review date. Usage - /practice <what you learned>
+---
+# /practice — the shared craft commons
+The craft of building with AI changes monthly — new models, new agent patterns, a cheaper way to do the
+thing you did the expensive way last week. Each founder picks up discoveries on their own; without a
+shared place, they stay trapped in one person's head, and a cofounder re-learns the same lesson the hard
+way (or keeps doing it the outdated, expensive way).
+A **practice** is one such learning, written down once and **shared** with the whole team: *"use streaming
+for the long generations — it cut perceived latency in half"*, *"`claude-haiku-4-5` is plenty for the
+classification step, ~10× cheaper than Opus"*, *"this prompt structure stopped the JSON drift."* The point
+isn't ceremony — it's that **you both get current together and can focus on building**, not on worrying
+whether you're behind or overspending.
+## When to reach for it
+- You found a genuinely better/cheaper/newer way to do something with AI and your cofounder would benefit.
+- You hit a sharp edge (a model quirk, a cost trap, an agent failure mode) and worked out how to handle it.
+- A `/vet` verdict or outside best-practice proved itself in *your* build — capture what actually held.
+Skip it for one-off trivia or anything that belongs in the code as a comment. A practice is a *transferable*
+lesson, not a changelog line.
+## How to run it
+1. **Resolve who learned it** (the credit — and so a teammate knows who to ask):
+   ```bash
+   gh api user --jq '.login' 2>/dev/null || git config user.name
+   ```
+   Use it as `@<login>`; leave `@you` if neither resolves. Never fabricate.
+2. **Pick the next number.** Highest `PRAC-NNN` in `docs/practices/` + 1 (first is `PRAC-001`). Create the
+   directory if needed.
+3. **Write `docs/practices/PRAC-NNN-<slug>.md`:**
+   ```markdown
+   ---
+   id: PRAC-NNN
+   type: practice
+   owner: "@<github-login of whoever learned it>"
+   status: active           # active | stale | retired
+   created: {{today}}
+   applies_to: <what this is about — e.g. "Claude Code" / "model: opus-4.8" / "prompting" / "cost">
+   review_by: <YYYY-MM-DD>   # optional — when to re-check it's still the best way (see staleness below)
+   ---
+   # PRAC-NNN — <the learning, in one line>
+   ## What we learned
+   The practice itself, concretely enough that your cofounder could apply it tomorrow.
+   ## Why it works
+   The reasoning — so it transfers, and so a teammate can tell when it stops applying.
+   ## How to apply
+   The shortest path to using it here. A snippet, a command, a setting — whatever makes it real.
+   ```
+4. **Fill from what the founder gave you.** Don't pad it. A two-line practice that's true beats a page that
+   guesses. Blanks are honest.
+## Staleness — the part that keeps you current (not just documented)
+The AI craft moves fast, so a practice can quietly **go out of date** — a model that was cheapest last
+month isn't, a workaround a new release made unnecessary. Two honest moves:
+- **Set `review_by:`** for anything tied to a specific model, price, or tool version — a date to re-check
+  *"is this still the best way?"* (Don't set it for timeless practices.) When it passes, `/revalidate` the
+  practice the same way you would a paused idea: still true? still the best way? anything changed? → keep /
+  update / retire (flip `status:` to `retired`, or supersede with a new `PRAC`).
+- This is the team's quiet defense against **being outdated or overspending** — BOSS surfaces the question;
+  you don't have to carry the anxiety of tracking every model release yourself.
+## Shared by construction
+`docs/practices/` **commits with the repo** (it's not gitignored), so the moment you push, every practice
+is backed up and your cofounder has it. Attribution (`owner:`) is recognition + a pointer to who to ask —
+not a scoreboard. This is the hive-mind half of the founder layer: you each make the other richer.
+## What this is not
+- Not a style guide or the conventions doc — those live in `AGENTS.md`. A practice is a *discovered lesson*,
+  often AI-specific and time-sensitive.
+- Not a place to log every change. If it isn't transferable craft, it doesn't earn a `PRAC`.
+- Not a ranking. The commons measures *what the team knows*, never *who contributed more*.