npm - joycraft - Versions diffs - 0.5.5 → 0.5.7 - Mend

joycraft 0.5.5 → 0.5.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

package/README.md +67 -0
package/dist/{chunk-QIYIJ7VR.js → chunk-G342HURJ.js} +186 -1
package/dist/chunk-G342HURJ.js.map +1 -0
package/dist/cli.js +3 -3
package/dist/{init-MKRU6SYT.js → init-7FUTURUY.js} +3 -2
package/dist/init-7FUTURUY.js.map +1 -0
package/dist/{init-autofix-V2Y2O4HO.js → init-autofix-K4B5BD5V.js} +2 -2
package/dist/{upgrade-HK6F5SXI.js → upgrade-2Y7D2HCD.js} +2 -2
package/package.json +1 -1
package/dist/chunk-QIYIJ7VR.js.map +0 -1
package/dist/init-MKRU6SYT.js.map +0 -1
/package/dist/{init-autofix-V2Y2O4HO.js.map → init-autofix-K4B5BD5V.js.map} +0 -0
/package/dist/{upgrade-HK6F5SXI.js.map → upgrade-2Y7D2HCD.js.map} +0 -0

package/README.md CHANGED Viewed

@@ -72,6 +72,8 @@ Joycraft auto-detects your tech stack and creates:
   - `/joycraft-interview` Lightweight brainstorm. Yap about ideas, get a structured summary
   - `/joycraft-decompose` Break a brief into small, testable specs
   - `/joycraft-add-fact` Capture project knowledge on the fly -- routes to the right context doc
+  - `/joycraft-lockdown` Generate constrained execution boundaries (read-only tests, deny patterns)
+  - `/joycraft-verify` Spawn a separate subagent to independently verify implementation against spec
   - `/joycraft-session-end` Capture discoveries, verify, commit, push
   - `/joycraft-implement-level5` Set up Level 5 (autofix loop, holdout scenarios, scenario evolution)
 - **docs/** structure: `briefs/`, `specs/`, `discoveries/`, `contracts/`, `decisions/`, `context/`
@@ -96,6 +98,8 @@ After init, open Claude Code and use the installed skills:
 /joycraft-new-feature           # Interview → Feature Brief → Atomic Specs → ready to execute
 /joycraft-decompose             # Break any feature into small, independent specs
 /joycraft-add-fact              # Capture a fact mid-session -- auto-routes to the right context doc
+/joycraft-lockdown              # Generate constrained execution boundaries for autonomous sessions
+/joycraft-verify                # Independent verification -- spawns a subagent to check your work
 /joycraft-session-end           # Wrap up: discoveries, verification, commit, push
 /joycraft-implement-level5     # Set up Level 5 (autofix, holdout scenarios, evolution)
 ```
@@ -544,6 +548,65 @@ One question: **how autonomous should git be?**
 Either way, Joycraft generates explicit git boundaries in your CLAUDE.md: commit message format (`verb: message`), specific file staging (no `git add -A`), no secrets in commits, no force-pushing.
+## Test-First Development
+Joycraft enforces a test-first workflow because tests are the mechanism to autonomy. Without tests, your agent implements 9 specs and you have to manually verify each one. With tests, the agent knows when it's done and you can trust the output.
+### How it works
+When you run `/joycraft-new-feature`, the interview now includes test-focused questions: what test types your project uses, how fast your tests need to run for iteration, and whether you want lockdown mode. Every atomic spec generated by `/joycraft-decompose` includes a **Test Plan** that maps each acceptance criterion to at least one test.
+The execution order is enforced:
+1. **Write failing tests first** -- the agent writes tests from the spec's Test Plan
+2. **Run them and confirm they fail** -- if they pass immediately, something is wrong (you're testing the wrong thing)
+3. **Implement until tests pass** -- the tests are the contract
+### The three laws of test harnesses
+These are baked into every spec template, discovered through real autonomous development:
+1. **Tests must fail first.** If your test harness doesn't have failing tests, the agent will write tests that pass trivially -- testing the library instead of your function.
+2. **Tests must run against your actual function.** Not a reimplementation, not a mock, not the wrapped library. The test calls your code.
+3. **Tests must detect individual changes.** You need fast smoke tests (seconds, not minutes) so you know if a single change helped or hurt.
+### Lockdown mode
+For complex stacks or long autonomous sessions, `/joycraft-lockdown` generates constrained execution boundaries:
+- **NEVER rules** for editing test files (read-only)
+- **Deny patterns** for package installs, network access, log reading
+- **Permission mode recommendations** (see below)
+This prevents the agent from going rogue -- downloading SDKs, pinging random IPs, clearing test files, or filling context with log output. Lockdown is optional and most useful for complex tech stacks (hardware, firmware, multi-device workflows).
+### Independent verification
+`/joycraft-verify` spawns a separate subagent with a clean context window to independently check your implementation against the spec. The verifier reads the acceptance criteria, runs the tests, and produces a structured pass/fail verdict. It cannot edit any code -- read-only plus test execution only.
+This follows [Anthropic's finding](https://www.anthropic.com/engineering/harness-design-long-running-apps) that "agents reliably skew positive when grading their own work" and that separating the worker from the evaluator consistently outperforms self-evaluation.
+## Claude Code Permission Modes
+You do **not** need `--dangerously-skip-permissions` for autonomous development. Claude Code offers safer alternatives that Joycraft recommends based on your use case:
+| Your situation | Permission mode | What it does |
+|---|---|---|
+| Interactive development | `acceptEdits` | Auto-approves file edits, prompts for shell commands |
+| Long autonomous session | `auto` | Safety classifier reviews each action, blocks scope escalation |
+| Autonomous spec execution | `dontAsk` + allowlist | Only pre-approved commands run, everything else denied |
+| Planning and exploration | `plan` | Claude can only read and propose, no edits allowed |
+### When to use what
+**`--permission-mode auto`** is the best default for most developers. A background classifier (Sonnet) reviews each action before execution, blocking things like: downloading unexpected packages, accessing unfamiliar infrastructure, or escalating beyond the task scope. It adds minimal latency and catches the exact problems that make autonomous development scary.
+**`--permission-mode dontAsk`** is for maximum control. You define an explicit allowlist of what the agent can do (write code, run specific test commands) and everything else is silently denied. No prompts, no surprises. This is what Joycraft's `/joycraft-lockdown` skill helps you configure.
+**`--dangerously-skip-permissions`** should only be used in isolated containers or VMs with no internet access. It bypasses all safety checks and cannot be overridden by subagents.
+Both `/joycraft-lockdown` and `/joycraft-tune` now recommend the appropriate permission mode based on your project's risk profile.
 ## How It Works with AI Agents
 **Claude Code** reads `CLAUDE.md` automatically and discovers skills in `.claude/skills/`. The behavioral boundaries guide every action. The skills provide structured workflows accessible via `/slash-commands`.
@@ -581,6 +644,10 @@ Joycraft's approach is synthesized from several sources:
 **Behavioral boundaries.** CLAUDE.md isn't a suggestion box, it's a contract. Joycraft installs a three-tier boundary framework (Always / Ask First / Never) that prevents the most common AI development failures: overwriting user files, skipping tests, pushing without approval, hardcoding secrets. This is [Addy Osmani's](https://addyosmani.com/blog/good-spec/) "boundaries" principle made concrete.
+**Test-first as the mechanism to autonomy.** Tests aren't a nice-to-have, they're the bridge between "agent writes code" and "agent writes *correct* code." Every spec includes a Test Plan mapping acceptance criteria to tests, and the agent must write failing tests before implementing. This follows the three laws of test harnesses discovered through real autonomous development, and aligns with [Anthropic's harness design research](https://www.anthropic.com/engineering/harness-design-long-running-apps) which found that agents reliably skip verification unless explicitly constrained.
+**Separation of evaluation from implementation.** [Anthropic's research](https://www.anthropic.com/engineering/harness-design-long-running-apps) found that "agents reliably skew positive when grading their own work." Joycraft addresses this at two levels: `/joycraft-verify` spawns a separate subagent with clean context to independently verify against the spec, and Level 5's holdout scenarios provide external evaluation the implementation agent can never see.
 **Knowledge capture over session notes.** Most session notes are never re-read. Joycraft's `/joycraft-session-end` skill captures only *discoveries*: assumptions that were wrong, APIs that behaved unexpectedly, decisions made during implementation that aren't in the spec. If nothing surprising happened, you capture nothing. This keeps the signal-to-noise ratio high.
 **External holdout scenarios.** [StrongDM's Software Factory](https://factory.strongdm.ai/) proved that AI agents will [actively game visible test suites](https://palisaderesearch.org/blog/specification-gaming). Their solution: scenarios that live *outside* the codebase, invisible to the agent during development. Like a holdout set in ML, this prevents overfitting. Joycraft now implements this directly. `init-autofix` sets up the holdout wall, the scenario agent, and the GitHub App integration.

package/dist/{chunk-QIYIJ7VR.js → chunk-G342HURJ.js} RENAMED Viewed

@@ -904,6 +904,28 @@ Based on their answer, use the appropriate git rules in the Behavioral Boundarie
 - Ask "should I push?" or "should I create a PR?" \u2014 the answer is always yes, just do it
 \`\`\`
+### Permission Mode Recommendation
+After the git autonomy question and before the risk interview, recommend a Claude Code permission mode based on what you've learned so far. Present this guidance:
+> **What permission mode should you use?**
+>
+> | Your situation | Use | Why |
+> |---|---|---|
+> | Autonomous spec execution | \`--permission-mode dontAsk\` + allowlist | Only pre-approved commands run |
+> | Long session with some trust | \`--permission-mode auto\` | Safety classifier reviews each action |
+> | Interactive development | \`--permission-mode acceptEdits\` | Auto-approves file edits, prompts for commands |
+>
+> You do NOT need \`--dangerously-skip-permissions\`. The modes above provide autonomy with safety.
+**If the user chose Autonomous git:** Recommend \`auto\` mode as a good default -- it provides autonomy while the safety classifier catches risky operations. Note that \`dontAsk\` is even more autonomous but requires a well-configured allowlist.
+**If the user chose Cautious git:** Recommend \`auto\` mode -- it matches their preference for safety with less manual intervention than the default.
+**If the risk interview reveals production databases, live APIs, or billing systems:** Upgrade the recommendation to \`dontAsk\` with a tight allowlist. Explain that \`dontAsk\` with explicit deny patterns is safer than \`auto\` for high-risk environments because it uses a deterministic allowlist rather than a classifier.
+This is informational only -- do not change the user's permission mode. Just tell them what to use when they launch Claude Code.
 ### Risk Interview
 Before applying upgrades, ask 3-5 targeted questions to capture what's dangerous in this project. Skip this if \`docs/context/production-map.md\` or \`docs/context/dangerous-assumptions.md\` already exist (offer to update instead).
@@ -1300,6 +1322,26 @@ Adjust the content based on the actual interview responses:
 - Only include NEVER rules for directories/files the user specified
 - If the user allowed certain network tools or package managers, exclude those
+## Recommended Permission Mode
+After generating the boundaries above, also recommend a Claude Code permission mode. Include this section in your output:
+\`\`\`
+### Recommended Permission Mode
+You don't need \\\`--dangerously-skip-permissions\\\`. Safer alternatives exist:
+| Your situation | Use | Why |
+|---|---|---|
+| Autonomous spec execution | \\\`--permission-mode dontAsk\\\` + allowlist above | Only pre-approved commands run |
+| Long session with some trust | \\\`--permission-mode auto\\\` | Safety classifier reviews each action |
+| Interactive development | \\\`--permission-mode acceptEdits\\\` | Auto-approves file edits, prompts for commands |
+**For lockdown mode, we recommend \\\`--permission-mode dontAsk\\\`** combined with the deny patterns above. This gives you full autonomy for allowed operations while blocking everything else -- no classifier overhead, no prompts, and no safety bypass.
+\\\`--dangerously-skip-permissions\\\` disables ALL safety checks. The modes above give you autonomy without removing the guardrails.
+\`\`\`
 ## Step 4: Offer to Apply
 If the user asks you to apply the changes:
@@ -1308,6 +1350,149 @@ If the user asks you to apply the changes:
 2. **For settings.json:** Read the existing \`.claude/settings.json\`, show the user what the \`permissions.deny\` array will look like after adding the new patterns. Ask for confirmation before writing.
 **Never auto-apply. Always show the exact changes and wait for explicit approval.**
+`,
+  "joycraft-verify.md": `---
+name: joycraft-verify
+description: Spawn an independent verifier subagent to check an implementation against its spec -- read-only, no code edits, structured pass/fail verdict
+---
+# Verify Implementation Against Spec
+The user wants independent verification of an implementation. Your job is to find the relevant spec, extract its acceptance criteria and test plan, then spawn a separate verifier subagent that checks each criterion and produces a structured verdict.
+**Why a separate subagent?** Anthropic's research found that agents reliably skew positive when grading their own work. Separating the agent doing the work from the agent judging it consistently outperforms self-evaluation. The verifier gets a clean context window with no implementation bias.
+## Step 1: Find the Spec
+If the user provided a spec path (e.g., \`/joycraft-verify docs/specs/2026-03-26-add-widget.md\`), use that path directly.
+If no path was provided, scan \`docs/specs/\` for spec files. Pick the most recently modified \`.md\` file in that directory. If \`docs/specs/\` doesn't exist or is empty, tell the user:
+> No specs found in \`docs/specs/\`. Please provide a spec path: \`/joycraft-verify path/to/spec.md\`
+## Step 2: Read and Parse the Spec
+Read the spec file and extract:
+1. **Spec name** -- from the H1 title
+2. **Acceptance Criteria** -- the checklist under the \`## Acceptance Criteria\` section
+3. **Test Plan** -- the table under the \`## Test Plan\` section, including any test commands
+4. **Constraints** -- the \`## Constraints\` section if present
+If the spec has no Acceptance Criteria section, tell the user:
+> This spec doesn't have an Acceptance Criteria section. Verification needs criteria to check against. Add acceptance criteria to the spec and try again.
+If the spec has no Test Plan section, note this but proceed -- the verifier can still check criteria by reading code and running any available project tests.
+## Step 3: Identify Test Commands
+Look for test commands in these locations (in priority order):
+1. The spec's Test Plan section (look for commands in backticks or "Type" column entries like "unit", "integration", "e2e", "build")
+2. The project's CLAUDE.md (look for test/build commands in the Development Workflow section)
+3. Common defaults based on the project type:
+   - Node.js: \`npm test\` or \`pnpm test --run\`
+   - Python: \`pytest\`
+   - Rust: \`cargo test\`
+   - Go: \`go test ./...\`
+Build a list of specific commands the verifier should run.
+## Step 4: Spawn the Verifier Subagent
+Use Claude Code's Agent tool to spawn a subagent with the following prompt. Replace the placeholders with the actual content extracted in Steps 2-3.
+\`\`\`
+You are a QA verifier. Your job is to independently verify an implementation against its spec. You have NO context about how the implementation was done -- you are checking it fresh.
+RULES -- these are hard constraints, not suggestions:
+- You may READ any file using the Read tool or cat
+- You may RUN these specific test/build commands: [TEST_COMMANDS]
+- You may NOT edit, create, or delete any files
+- You may NOT run commands that modify state (no git commit, no npm install, no file writes)
+- You may NOT install packages or access the network
+- Report what you OBSERVE, not what you expect or hope
+SPEC NAME: [SPEC_NAME]
+ACCEPTANCE CRITERIA:
+[ACCEPTANCE_CRITERIA]
+TEST PLAN:
+[TEST_PLAN]
+CONSTRAINTS:
+[CONSTRAINTS_OR_NONE]
+YOUR TASK:
+For each acceptance criterion, determine if it PASSES or FAILS based on evidence:
+1. Run the test commands listed above. Record the output.
+2. For each acceptance criterion:
+   a. Check if there is a corresponding test and whether it passes
+   b. If no test exists, read the relevant source files to verify the criterion is met
+   c. If the criterion cannot be verified by reading code or running tests, mark it MANUAL CHECK NEEDED
+3. For criteria about build/test passing, actually run the commands and report results.
+OUTPUT FORMAT -- you MUST use this exact format:
+VERIFICATION REPORT
+| # | Criterion | Verdict | Evidence |
+|---|-----------|---------|----------|
+| 1 | [criterion text] | PASS/FAIL/MANUAL CHECK NEEDED | [what you observed] |
+| 2 | [criterion text] | PASS/FAIL/MANUAL CHECK NEEDED | [what you observed] |
+[continue for all criteria]
+SUMMARY: X/Y criteria passed. [Z failures need attention. / All criteria verified.]
+If any test commands fail to run (missing dependencies, wrong command, etc.), report the error as evidence for a FAIL verdict on the relevant criterion.
+\`\`\`
+## Step 5: Format and Present the Verdict
+Take the subagent's response and present it to the user in this format:
+\`\`\`
+## Verification Report -- [Spec Name]
+| # | Criterion | Verdict | Evidence |
+|---|-----------|---------|----------|
+| 1 | ... | PASS | ... |
+| 2 | ... | FAIL | ... |
+**Overall: X/Y criteria passed.**
+[If all passed:]
+All criteria verified. Ready to commit and open a PR.
+[If any failed:]
+N failures need attention. Review the evidence above and fix before proceeding.
+[If any MANUAL CHECK NEEDED:]
+N criteria need manual verification -- they can't be checked by reading code or running tests alone.
+\`\`\`
+## Step 6: Suggest Next Steps
+Based on the verdict:
+- **All PASS:** Suggest committing and opening a PR, or running \`/joycraft-session-end\` to capture discoveries.
+- **Some FAIL:** List the failed criteria and suggest the user fix them, then run \`/joycraft-verify\` again.
+- **MANUAL CHECK NEEDED items:** Explain what needs human eyes and why automation couldn't verify it.
+**Do NOT offer to fix failures yourself.** The verifier reports; the human (or implementation agent in a separate turn) decides what to do. This separation is the whole point.
+## Edge Cases
+| Scenario | Behavior |
+|----------|----------|
+| Spec has no Test Plan | Warn that verification is weaker without a test plan, but proceed by checking criteria through code reading and any available project-level tests |
+| All tests pass but a criterion is not testable | Mark as MANUAL CHECK NEEDED with explanation |
+| Subagent can't run tests (missing deps) | Report the error as FAIL evidence |
+| No specs found and no path given | Tell user to provide a spec path or create a spec first |
+| Spec status is "Complete" | Still run verification -- "Complete" means the implementer thinks it's done, verification confirms |
 `
 };
 var TEMPLATES = {
@@ -2676,4 +2861,4 @@ export {
   SKILLS,
   TEMPLATES
 };
-//# sourceMappingURL=chunk-QIYIJ7VR.js.map
+//# sourceMappingURL=chunk-G342HURJ.js.map