joycraft 0.5.5 → 0.5.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -72,6 +72,8 @@ Joycraft auto-detects your tech stack and creates:
72
72
  - `/joycraft-interview` Lightweight brainstorm. Yap about ideas, get a structured summary
73
73
  - `/joycraft-decompose` Break a brief into small, testable specs
74
74
  - `/joycraft-add-fact` Capture project knowledge on the fly -- routes to the right context doc
75
+ - `/joycraft-lockdown` Generate constrained execution boundaries (read-only tests, deny patterns)
76
+ - `/joycraft-verify` Spawn a separate subagent to independently verify implementation against spec
75
77
  - `/joycraft-session-end` Capture discoveries, verify, commit, push
76
78
  - `/joycraft-implement-level5` Set up Level 5 (autofix loop, holdout scenarios, scenario evolution)
77
79
  - **docs/** structure: `briefs/`, `specs/`, `discoveries/`, `contracts/`, `decisions/`, `context/`
@@ -96,6 +98,8 @@ After init, open Claude Code and use the installed skills:
96
98
  /joycraft-new-feature # Interview → Feature Brief → Atomic Specs → ready to execute
97
99
  /joycraft-decompose # Break any feature into small, independent specs
98
100
  /joycraft-add-fact # Capture a fact mid-session -- auto-routes to the right context doc
101
+ /joycraft-lockdown # Generate constrained execution boundaries for autonomous sessions
102
+ /joycraft-verify # Independent verification -- spawns a subagent to check your work
99
103
  /joycraft-session-end # Wrap up: discoveries, verification, commit, push
100
104
  /joycraft-implement-level5 # Set up Level 5 (autofix, holdout scenarios, evolution)
101
105
  ```
@@ -544,6 +548,65 @@ One question: **how autonomous should git be?**
544
548
 
545
549
  Either way, Joycraft generates explicit git boundaries in your CLAUDE.md: commit message format (`verb: message`), specific file staging (no `git add -A`), no secrets in commits, no force-pushing.
546
550
 
551
+ ## Test-First Development
552
+
553
+ Joycraft enforces a test-first workflow because tests are the mechanism to autonomy. Without tests, your agent implements 9 specs and you have to manually verify each one. With tests, the agent knows when it's done and you can trust the output.
554
+
555
+ ### How it works
556
+
557
+ When you run `/joycraft-new-feature`, the interview now includes test-focused questions: what test types your project uses, how fast your tests need to run for iteration, and whether you want lockdown mode. Every atomic spec generated by `/joycraft-decompose` includes a **Test Plan** that maps each acceptance criterion to at least one test.
558
+
559
+ The execution order is enforced:
560
+
561
+ 1. **Write failing tests first** -- the agent writes tests from the spec's Test Plan
562
+ 2. **Run them and confirm they fail** -- if they pass immediately, something is wrong (you're testing the wrong thing)
563
+ 3. **Implement until tests pass** -- the tests are the contract
564
+
565
+ ### The three laws of test harnesses
566
+
567
+ These are baked into every spec template, discovered through real autonomous development:
568
+
569
+ 1. **Tests must fail first.** If your test harness doesn't have failing tests, the agent will write tests that pass trivially -- testing the library instead of your function.
570
+ 2. **Tests must run against your actual function.** Not a reimplementation, not a mock, not the wrapped library. The test calls your code.
571
+ 3. **Tests must detect individual changes.** You need fast smoke tests (seconds, not minutes) so you know if a single change helped or hurt.
572
+
573
+ ### Lockdown mode
574
+
575
+ For complex stacks or long autonomous sessions, `/joycraft-lockdown` generates constrained execution boundaries:
576
+
577
+ - **NEVER rules** for editing test files (read-only)
578
+ - **Deny patterns** for package installs, network access, log reading
579
+ - **Permission mode recommendations** (see below)
580
+
581
+ This prevents the agent from going rogue -- downloading SDKs, pinging random IPs, clearing test files, or filling context with log output. Lockdown is optional and most useful for complex tech stacks (hardware, firmware, multi-device workflows).
582
+
583
+ ### Independent verification
584
+
585
+ `/joycraft-verify` spawns a separate subagent with a clean context window to independently check your implementation against the spec. The verifier reads the acceptance criteria, runs the tests, and produces a structured pass/fail verdict. It cannot edit any code -- read-only plus test execution only.
586
+
587
+ This follows [Anthropic's finding](https://www.anthropic.com/engineering/harness-design-long-running-apps) that "agents reliably skew positive when grading their own work" and that separating the worker from the evaluator consistently outperforms self-evaluation.
588
+
589
+ ## Claude Code Permission Modes
590
+
591
+ You do **not** need `--dangerously-skip-permissions` for autonomous development. Claude Code offers safer alternatives that Joycraft recommends based on your use case:
592
+
593
+ | Your situation | Permission mode | What it does |
594
+ |---|---|---|
595
+ | Interactive development | `acceptEdits` | Auto-approves file edits, prompts for shell commands |
596
+ | Long autonomous session | `auto` | Safety classifier reviews each action, blocks scope escalation |
597
+ | Autonomous spec execution | `dontAsk` + allowlist | Only pre-approved commands run, everything else denied |
598
+ | Planning and exploration | `plan` | Claude can only read and propose, no edits allowed |
599
+
600
+ ### When to use what
601
+
602
+ **`--permission-mode auto`** is the best default for most developers. A background classifier (Sonnet) reviews each action before execution, blocking things like: downloading unexpected packages, accessing unfamiliar infrastructure, or escalating beyond the task scope. It adds minimal latency and catches the exact problems that make autonomous development scary.
603
+
604
+ **`--permission-mode dontAsk`** is for maximum control. You define an explicit allowlist of what the agent can do (write code, run specific test commands) and everything else is silently denied. No prompts, no surprises. This is what Joycraft's `/joycraft-lockdown` skill helps you configure.
605
+
606
+ **`--dangerously-skip-permissions`** should only be used in isolated containers or VMs with no internet access. It bypasses all safety checks and cannot be overridden by subagents.
607
+
608
+ Both `/joycraft-lockdown` and `/joycraft-tune` now recommend the appropriate permission mode based on your project's risk profile.
609
+
547
610
  ## How It Works with AI Agents
548
611
 
549
612
  **Claude Code** reads `CLAUDE.md` automatically and discovers skills in `.claude/skills/`. The behavioral boundaries guide every action. The skills provide structured workflows accessible via `/slash-commands`.
@@ -581,6 +644,10 @@ Joycraft's approach is synthesized from several sources:
581
644
 
582
645
  **Behavioral boundaries.** CLAUDE.md isn't a suggestion box, it's a contract. Joycraft installs a three-tier boundary framework (Always / Ask First / Never) that prevents the most common AI development failures: overwriting user files, skipping tests, pushing without approval, hardcoding secrets. This is [Addy Osmani's](https://addyosmani.com/blog/good-spec/) "boundaries" principle made concrete.
583
646
 
647
+ **Test-first as the mechanism to autonomy.** Tests aren't a nice-to-have, they're the bridge between "agent writes code" and "agent writes *correct* code." Every spec includes a Test Plan mapping acceptance criteria to tests, and the agent must write failing tests before implementing. This follows the three laws of test harnesses discovered through real autonomous development, and aligns with [Anthropic's harness design research](https://www.anthropic.com/engineering/harness-design-long-running-apps) which found that agents reliably skip verification unless explicitly constrained.
648
+
649
+ **Separation of evaluation from implementation.** [Anthropic's research](https://www.anthropic.com/engineering/harness-design-long-running-apps) found that "agents reliably skew positive when grading their own work." Joycraft addresses this at two levels: `/joycraft-verify` spawns a separate subagent with clean context to independently verify against the spec, and Level 5's holdout scenarios provide external evaluation the implementation agent can never see.
650
+
584
651
  **Knowledge capture over session notes.** Most session notes are never re-read. Joycraft's `/joycraft-session-end` skill captures only *discoveries*: assumptions that were wrong, APIs that behaved unexpectedly, decisions made during implementation that aren't in the spec. If nothing surprising happened, you capture nothing. This keeps the signal-to-noise ratio high.
585
652
 
586
653
  **External holdout scenarios.** [StrongDM's Software Factory](https://factory.strongdm.ai/) proved that AI agents will [actively game visible test suites](https://palisaderesearch.org/blog/specification-gaming). Their solution: scenarios that live *outside* the codebase, invisible to the agent during development. Like a holdout set in ML, this prevents overfitting. Joycraft now implements this directly. `init-autofix` sets up the holdout wall, the scenario agent, and the GitHub App integration.
@@ -904,6 +904,28 @@ Based on their answer, use the appropriate git rules in the Behavioral Boundarie
904
904
  - Ask "should I push?" or "should I create a PR?" \u2014 the answer is always yes, just do it
905
905
  \`\`\`
906
906
 
907
+ ### Permission Mode Recommendation
908
+
909
+ After the git autonomy question and before the risk interview, recommend a Claude Code permission mode based on what you've learned so far. Present this guidance:
910
+
911
+ > **What permission mode should you use?**
912
+ >
913
+ > | Your situation | Use | Why |
914
+ > |---|---|---|
915
+ > | Autonomous spec execution | \`--permission-mode dontAsk\` + allowlist | Only pre-approved commands run |
916
+ > | Long session with some trust | \`--permission-mode auto\` | Safety classifier reviews each action |
917
+ > | Interactive development | \`--permission-mode acceptEdits\` | Auto-approves file edits, prompts for commands |
918
+ >
919
+ > You do NOT need \`--dangerously-skip-permissions\`. The modes above provide autonomy with safety.
920
+
921
+ **If the user chose Autonomous git:** Recommend \`auto\` mode as a good default -- it provides autonomy while the safety classifier catches risky operations. Note that \`dontAsk\` is even more autonomous but requires a well-configured allowlist.
922
+
923
+ **If the user chose Cautious git:** Recommend \`auto\` mode -- it matches their preference for safety with less manual intervention than the default.
924
+
925
+ **If the risk interview reveals production databases, live APIs, or billing systems:** Upgrade the recommendation to \`dontAsk\` with a tight allowlist. Explain that \`dontAsk\` with explicit deny patterns is safer than \`auto\` for high-risk environments because it uses a deterministic allowlist rather than a classifier.
926
+
927
+ This is informational only -- do not change the user's permission mode. Just tell them what to use when they launch Claude Code.
928
+
907
929
  ### Risk Interview
908
930
 
909
931
  Before applying upgrades, ask 3-5 targeted questions to capture what's dangerous in this project. Skip this if \`docs/context/production-map.md\` or \`docs/context/dangerous-assumptions.md\` already exist (offer to update instead).
@@ -1300,6 +1322,26 @@ Adjust the content based on the actual interview responses:
1300
1322
  - Only include NEVER rules for directories/files the user specified
1301
1323
  - If the user allowed certain network tools or package managers, exclude those
1302
1324
 
1325
+ ## Recommended Permission Mode
1326
+
1327
+ After generating the boundaries above, also recommend a Claude Code permission mode. Include this section in your output:
1328
+
1329
+ \`\`\`
1330
+ ### Recommended Permission Mode
1331
+
1332
+ You don't need \\\`--dangerously-skip-permissions\\\`. Safer alternatives exist:
1333
+
1334
+ | Your situation | Use | Why |
1335
+ |---|---|---|
1336
+ | Autonomous spec execution | \\\`--permission-mode dontAsk\\\` + allowlist above | Only pre-approved commands run |
1337
+ | Long session with some trust | \\\`--permission-mode auto\\\` | Safety classifier reviews each action |
1338
+ | Interactive development | \\\`--permission-mode acceptEdits\\\` | Auto-approves file edits, prompts for commands |
1339
+
1340
+ **For lockdown mode, we recommend \\\`--permission-mode dontAsk\\\`** combined with the deny patterns above. This gives you full autonomy for allowed operations while blocking everything else -- no classifier overhead, no prompts, and no safety bypass.
1341
+
1342
+ \\\`--dangerously-skip-permissions\\\` disables ALL safety checks. The modes above give you autonomy without removing the guardrails.
1343
+ \`\`\`
1344
+
1303
1345
  ## Step 4: Offer to Apply
1304
1346
 
1305
1347
  If the user asks you to apply the changes:
@@ -1308,6 +1350,149 @@ If the user asks you to apply the changes:
1308
1350
  2. **For settings.json:** Read the existing \`.claude/settings.json\`, show the user what the \`permissions.deny\` array will look like after adding the new patterns. Ask for confirmation before writing.
1309
1351
 
1310
1352
  **Never auto-apply. Always show the exact changes and wait for explicit approval.**
1353
+ `,
1354
+ "joycraft-verify.md": `---
1355
+ name: joycraft-verify
1356
+ description: Spawn an independent verifier subagent to check an implementation against its spec -- read-only, no code edits, structured pass/fail verdict
1357
+ ---
1358
+
1359
+ # Verify Implementation Against Spec
1360
+
1361
+ The user wants independent verification of an implementation. Your job is to find the relevant spec, extract its acceptance criteria and test plan, then spawn a separate verifier subagent that checks each criterion and produces a structured verdict.
1362
+
1363
+ **Why a separate subagent?** Anthropic's research found that agents reliably skew positive when grading their own work. Separating the agent doing the work from the agent judging it consistently outperforms self-evaluation. The verifier gets a clean context window with no implementation bias.
1364
+
1365
+ ## Step 1: Find the Spec
1366
+
1367
+ If the user provided a spec path (e.g., \`/joycraft-verify docs/specs/2026-03-26-add-widget.md\`), use that path directly.
1368
+
1369
+ If no path was provided, scan \`docs/specs/\` for spec files. Pick the most recently modified \`.md\` file in that directory. If \`docs/specs/\` doesn't exist or is empty, tell the user:
1370
+
1371
+ > No specs found in \`docs/specs/\`. Please provide a spec path: \`/joycraft-verify path/to/spec.md\`
1372
+
1373
+ ## Step 2: Read and Parse the Spec
1374
+
1375
+ Read the spec file and extract:
1376
+
1377
+ 1. **Spec name** -- from the H1 title
1378
+ 2. **Acceptance Criteria** -- the checklist under the \`## Acceptance Criteria\` section
1379
+ 3. **Test Plan** -- the table under the \`## Test Plan\` section, including any test commands
1380
+ 4. **Constraints** -- the \`## Constraints\` section if present
1381
+
1382
+ If the spec has no Acceptance Criteria section, tell the user:
1383
+
1384
+ > This spec doesn't have an Acceptance Criteria section. Verification needs criteria to check against. Add acceptance criteria to the spec and try again.
1385
+
1386
+ If the spec has no Test Plan section, note this but proceed -- the verifier can still check criteria by reading code and running any available project tests.
1387
+
1388
+ ## Step 3: Identify Test Commands
1389
+
1390
+ Look for test commands in these locations (in priority order):
1391
+
1392
+ 1. The spec's Test Plan section (look for commands in backticks or "Type" column entries like "unit", "integration", "e2e", "build")
1393
+ 2. The project's CLAUDE.md (look for test/build commands in the Development Workflow section)
1394
+ 3. Common defaults based on the project type:
1395
+ - Node.js: \`npm test\` or \`pnpm test --run\`
1396
+ - Python: \`pytest\`
1397
+ - Rust: \`cargo test\`
1398
+ - Go: \`go test ./...\`
1399
+
1400
+ Build a list of specific commands the verifier should run.
1401
+
1402
+ ## Step 4: Spawn the Verifier Subagent
1403
+
1404
+ Use Claude Code's Agent tool to spawn a subagent with the following prompt. Replace the placeholders with the actual content extracted in Steps 2-3.
1405
+
1406
+ \`\`\`
1407
+ You are a QA verifier. Your job is to independently verify an implementation against its spec. You have NO context about how the implementation was done -- you are checking it fresh.
1408
+
1409
+ RULES -- these are hard constraints, not suggestions:
1410
+ - You may READ any file using the Read tool or cat
1411
+ - You may RUN these specific test/build commands: [TEST_COMMANDS]
1412
+ - You may NOT edit, create, or delete any files
1413
+ - You may NOT run commands that modify state (no git commit, no npm install, no file writes)
1414
+ - You may NOT install packages or access the network
1415
+ - Report what you OBSERVE, not what you expect or hope
1416
+
1417
+ SPEC NAME: [SPEC_NAME]
1418
+
1419
+ ACCEPTANCE CRITERIA:
1420
+ [ACCEPTANCE_CRITERIA]
1421
+
1422
+ TEST PLAN:
1423
+ [TEST_PLAN]
1424
+
1425
+ CONSTRAINTS:
1426
+ [CONSTRAINTS_OR_NONE]
1427
+
1428
+ YOUR TASK:
1429
+ For each acceptance criterion, determine if it PASSES or FAILS based on evidence:
1430
+
1431
+ 1. Run the test commands listed above. Record the output.
1432
+ 2. For each acceptance criterion:
1433
+ a. Check if there is a corresponding test and whether it passes
1434
+ b. If no test exists, read the relevant source files to verify the criterion is met
1435
+ c. If the criterion cannot be verified by reading code or running tests, mark it MANUAL CHECK NEEDED
1436
+ 3. For criteria about build/test passing, actually run the commands and report results.
1437
+
1438
+ OUTPUT FORMAT -- you MUST use this exact format:
1439
+
1440
+ VERIFICATION REPORT
1441
+
1442
+ | # | Criterion | Verdict | Evidence |
1443
+ |---|-----------|---------|----------|
1444
+ | 1 | [criterion text] | PASS/FAIL/MANUAL CHECK NEEDED | [what you observed] |
1445
+ | 2 | [criterion text] | PASS/FAIL/MANUAL CHECK NEEDED | [what you observed] |
1446
+ [continue for all criteria]
1447
+
1448
+ SUMMARY: X/Y criteria passed. [Z failures need attention. / All criteria verified.]
1449
+
1450
+ If any test commands fail to run (missing dependencies, wrong command, etc.), report the error as evidence for a FAIL verdict on the relevant criterion.
1451
+ \`\`\`
1452
+
1453
+ ## Step 5: Format and Present the Verdict
1454
+
1455
+ Take the subagent's response and present it to the user in this format:
1456
+
1457
+ \`\`\`
1458
+ ## Verification Report -- [Spec Name]
1459
+
1460
+ | # | Criterion | Verdict | Evidence |
1461
+ |---|-----------|---------|----------|
1462
+ | 1 | ... | PASS | ... |
1463
+ | 2 | ... | FAIL | ... |
1464
+
1465
+ **Overall: X/Y criteria passed.**
1466
+
1467
+ [If all passed:]
1468
+ All criteria verified. Ready to commit and open a PR.
1469
+
1470
+ [If any failed:]
1471
+ N failures need attention. Review the evidence above and fix before proceeding.
1472
+
1473
+ [If any MANUAL CHECK NEEDED:]
1474
+ N criteria need manual verification -- they can't be checked by reading code or running tests alone.
1475
+ \`\`\`
1476
+
1477
+ ## Step 6: Suggest Next Steps
1478
+
1479
+ Based on the verdict:
1480
+
1481
+ - **All PASS:** Suggest committing and opening a PR, or running \`/joycraft-session-end\` to capture discoveries.
1482
+ - **Some FAIL:** List the failed criteria and suggest the user fix them, then run \`/joycraft-verify\` again.
1483
+ - **MANUAL CHECK NEEDED items:** Explain what needs human eyes and why automation couldn't verify it.
1484
+
1485
+ **Do NOT offer to fix failures yourself.** The verifier reports; the human (or implementation agent in a separate turn) decides what to do. This separation is the whole point.
1486
+
1487
+ ## Edge Cases
1488
+
1489
+ | Scenario | Behavior |
1490
+ |----------|----------|
1491
+ | Spec has no Test Plan | Warn that verification is weaker without a test plan, but proceed by checking criteria through code reading and any available project-level tests |
1492
+ | All tests pass but a criterion is not testable | Mark as MANUAL CHECK NEEDED with explanation |
1493
+ | Subagent can't run tests (missing deps) | Report the error as FAIL evidence |
1494
+ | No specs found and no path given | Tell user to provide a spec path or create a spec first |
1495
+ | Spec status is "Complete" | Still run verification -- "Complete" means the implementer thinks it's done, verification confirms |
1311
1496
  `
1312
1497
  };
1313
1498
  var TEMPLATES = {
@@ -2676,4 +2861,4 @@ export {
2676
2861
  SKILLS,
2677
2862
  TEMPLATES
2678
2863
  };
2679
- //# sourceMappingURL=chunk-QIYIJ7VR.js.map
2864
+ //# sourceMappingURL=chunk-G342HURJ.js.map