safeword 0.7.7 → 0.8.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (34) hide show
  1. package/dist/{check-3TTR7WPD.js → check-2QCPMURS.js} +3 -3
  2. package/dist/{chunk-V5T3TEEQ.js → chunk-2P7QXQFL.js} +2 -2
  3. package/dist/{chunk-LETSGOTR.js → chunk-OXQIEKC7.js} +12 -7
  4. package/dist/{chunk-LETSGOTR.js.map → chunk-OXQIEKC7.js.map} +1 -1
  5. package/dist/{chunk-4NJCU6Z7.js → chunk-ZFRO5LB5.js} +2 -2
  6. package/dist/cli.js +6 -6
  7. package/dist/{diff-XJFCAA4Q.js → diff-6LJGYHY5.js} +3 -3
  8. package/dist/{reset-WPXUWP6Y.js → reset-VHNADDMA.js} +3 -3
  9. package/dist/{setup-DLS6K6EO.js → setup-QJNVWHTK.js} +3 -3
  10. package/dist/sync-TIBNJXB2.js +9 -0
  11. package/dist/{upgrade-4ESTGNXG.js → upgrade-GZSLDUEF.js} +4 -4
  12. package/package.json +15 -14
  13. package/templates/SAFEWORD.md +2 -4
  14. package/templates/doc-templates/feature-spec-template.md +1 -1
  15. package/templates/doc-templates/task-spec-template.md +1 -1
  16. package/templates/doc-templates/test-definitions-feature.md +1 -1
  17. package/templates/guides/planning-guide.md +431 -0
  18. package/templates/guides/testing-guide.md +439 -0
  19. package/templates/scripts/lint-md.sh +0 -0
  20. package/templates/skills/safeword-systematic-debugger/SKILL.md +1 -1
  21. package/templates/skills/safeword-tdd-enforcer/SKILL.md +31 -10
  22. package/dist/sync-AAG4SP5F.js +0 -9
  23. package/templates/guides/development-workflow.md +0 -627
  24. package/templates/guides/tdd-best-practices.md +0 -624
  25. package/templates/guides/test-definitions-guide.md +0 -343
  26. package/templates/guides/user-story-guide.md +0 -265
  27. /package/dist/{check-3TTR7WPD.js.map → check-2QCPMURS.js.map} +0 -0
  28. /package/dist/{chunk-V5T3TEEQ.js.map → chunk-2P7QXQFL.js.map} +0 -0
  29. /package/dist/{chunk-4NJCU6Z7.js.map → chunk-ZFRO5LB5.js.map} +0 -0
  30. /package/dist/{diff-XJFCAA4Q.js.map → diff-6LJGYHY5.js.map} +0 -0
  31. /package/dist/{reset-WPXUWP6Y.js.map → reset-VHNADDMA.js.map} +0 -0
  32. /package/dist/{setup-DLS6K6EO.js.map → setup-QJNVWHTK.js.map} +0 -0
  33. /package/dist/{sync-AAG4SP5F.js.map → sync-TIBNJXB2.js.map} +0 -0
  34. /package/dist/{upgrade-4ESTGNXG.js.map → upgrade-GZSLDUEF.js.map} +0 -0
@@ -0,0 +1,439 @@
1
+ # Testing Guide
2
+
3
+ Test methodology, TDD workflow, and test type selection.
4
+
5
+ ---
6
+
7
+ ## Test Philosophy
8
+
9
+ **Test what matters** - Focus on user experience and delivered features, not implementation details.
10
+
11
+ **Always test what you build** - Run tests yourself before completion. Don't ask the user to verify.
12
+
13
+ ---
14
+
15
+ ## Test Integrity (CRITICAL)
16
+
17
+ **NEVER modify, skip, or delete tests without explicit human approval.**
18
+
19
+ Tests are the specification. When a test fails, the implementation is wrong—not the test.
20
+
21
+ ### Forbidden Actions (Require Approval)
22
+
23
+ | Action | Why It's Forbidden |
24
+ | ----------------------------------------------- | --------------------------------- |
25
+ | Changing assertions to match broken code | Hides bugs instead of fixing them |
26
+ | Adding `.skip()`, `.only()`, `xit()`, `.todo()` | Makes failures invisible |
27
+ | Deleting tests you can't get passing | Removes coverage for edge cases |
28
+ | Weakening assertions (`toBe` → `toBeTruthy`) | Reduces test precision |
29
+ | Commenting out test code | Same as skipping |
30
+
31
+ ### What To Do Instead
32
+
33
+ 1. **Test fails?** → Fix the implementation, not the test
34
+ 2. **Test seems wrong?** → Explain why and ask before updating
35
+ 3. **Requirements changed?** → Explain the change and ask before updating tests
36
+ 4. **Test is flaky?** → Fix the flakiness (usually async issues), don't skip it
37
+ 5. **Test blocks progress?** → Ask for guidance, don't work around it
38
+
39
+ ---
40
+
41
+ ## Test Speed Hierarchy
42
+
43
+ **Goal:** Catch bugs quickly and cheaply with fast feedback loops.
44
+
45
+ **Rule:** Test with the fastest test type that can catch the bug.
46
+
47
+ ```text
48
+ Unit (milliseconds) ← Pure functions, no I/O
49
+
50
+ Integration (seconds) ← Multiple modules, database, API calls
51
+
52
+ LLM Eval (seconds) ← AI judgment, costs $0.01-0.30 per run
53
+
54
+ E2E (seconds-minutes) ← Full browser, user flows
55
+ ```
56
+
57
+ ---
58
+
59
+ ## When to Use Each Test Type
60
+
61
+ Answer these questions in order. Stop at first match.
62
+
63
+ ```text
64
+ 1. Does this test AI-generated content quality (tone, reasoning, creativity)?
65
+ └─ YES → LLM Evaluation
66
+ └─ NO → Continue to question 2
67
+
68
+ 2. Does this test require a real browser (Playwright/Cypress)?
69
+ └─ YES → E2E test
70
+ Examples: Multi-page navigation, browser-specific behavior, visual regression
71
+ Note: React Testing Library does NOT require a browser - that's integration
72
+ └─ NO → Continue to question 3
73
+
74
+ 3. Does this test interactions between multiple components/services?
75
+ └─ YES → Integration test
76
+ Examples: API + database, React component + state store
77
+ └─ NO → Continue to question 4
78
+
79
+ 4. Does this test a pure function (input → output, no I/O)?
80
+ └─ YES → Unit test
81
+ Examples: Calculations, formatters, validators, pure algorithms
82
+ └─ NO → Re-evaluate: What are you actually testing?
83
+ ```
84
+
85
+ **Edge cases:**
86
+
87
+ - Non-deterministic functions (Math.random, Date.now) → Unit test with mocked randomness
88
+ - Functions with environment dependencies (process.env) → Integration test
89
+ - Mixed pure + I/O logic → Extract pure logic, unit test it, integration test I/O
90
+
91
+ ---
92
+
93
+ ## Bug Detection Matrix
94
+
95
+ Which test type catches which bug?
96
+
97
+ | Bug Type | Unit? | Integration? | E2E? | Best Choice |
98
+ | ------------------------------------ | ----- | ------------ | ---- | --------------- |
99
+ | Calculation error | ✅ | ✅ | ✅ | Unit (fastest) |
100
+ | Invalid input handling | ✅ | ✅ | ✅ | Unit (fastest) |
101
+ | Database query returning wrong data | ❌ | ✅ | ✅ | Integration |
102
+ | API endpoint contract violation | ❌ | ✅ | ✅ | Integration |
103
+ | Race condition between services | ❌ | ✅ | ✅ | Integration |
104
+ | State management bug | ❌ | ✅ | ✅ | Integration |
105
+ | React component rendering wrong data | ❌ | ✅ | ✅ | Integration |
106
+ | CSS layout broken | ❌ | ❌ | ✅ | E2E (only) |
107
+ | Multi-page navigation broken | ❌ | ❌ | ✅ | E2E (only) |
108
+ | Browser-specific rendering | ❌ | ❌ | ✅ | E2E (only) |
109
+ | AI prompt quality degradation | ❌ | ❌ | ❌ | LLM Eval (only) |
110
+ | AI reasoning accuracy | ❌ | ❌ | ❌ | LLM Eval (only) |
111
+
112
+ **Key principle:** If multiple test types can catch the bug, choose the fastest one.
113
+
114
+ ---
115
+
116
+ ## TDD Workflow (RED → GREEN → REFACTOR)
117
+
118
+ Write tests BEFORE implementation. Tests define expected behavior, code makes them pass.
119
+
120
+ ### Phase 1: RED (Write Failing Test)
121
+
122
+ 1. Write test based on expected input/output
123
+ 2. **CRITICAL:** Run test and confirm it fails for the right reason
124
+ 3. **DO NOT write any implementation code yet**
125
+ 4. Commit: `test: [behavior]`
126
+
127
+ **Red Flags → STOP:**
128
+
129
+ | Flag | Action |
130
+ | ----------------------- | -------------------------------- |
131
+ | Test passes immediately | Rewrite - you're testing nothing |
132
+ | Syntax error | Fix syntax, not behavior |
133
+ | Wrote implementation | Delete it, return to test |
134
+ | Multiple tests | Pick ONE |
135
+
136
+ ### Phase 2: GREEN (Make Test Pass)
137
+
138
+ 1. Write **minimum** code to make test pass
139
+ 2. Run test - verify it passes
140
+ 3. No extra features (YAGNI)
141
+ 4. Commit: `feat:` or `fix:`
142
+
143
+ **Anti-Pattern: Mock Implementations**
144
+
145
+ LLMs sometimes hardcode values to pass tests. This is not TDD.
146
+
147
+ ```typescript
148
+ // ❌ BAD - Hardcoded to pass test
149
+ function calculateDiscount(amount, tier) {
150
+ return 80; // Passes test but isn't real
151
+ }
152
+
153
+ // ✅ GOOD - Actual logic
154
+ function calculateDiscount(amount, tier) {
155
+ if (tier === 'VIP') return amount * 0.8;
156
+ return amount;
157
+ }
158
+ ```
159
+
160
+ ### Phase 3: REFACTOR (Clean Up)
161
+
162
+ 1. Tests pass before changes
163
+ 2. Improve code (rename, extract, dedupe)
164
+ 3. Tests pass after changes
165
+ 4. Commit if changed: `refactor: [improvement]`
166
+
167
+ **NOT Allowed:** New behavior, changing assertions, adding tests.
168
+
169
+ ---
170
+
171
+ ## Test Type Examples
172
+
173
+ ### Unit Tests
174
+
175
+ ```typescript
176
+ // ✅ GOOD - Pure function
177
+ it('applies 20% discount for VIP users', () => {
178
+ expect(calculateDiscount(100, { tier: 'VIP' })).toBe(80);
179
+ });
180
+
181
+ // ❌ BAD - Testing implementation details
182
+ it('calls setState with correct value', () => {
183
+ expect(setState).toHaveBeenCalledWith({ count: 1 });
184
+ });
185
+ ```
186
+
187
+ ### Integration Tests
188
+
189
+ ```typescript
190
+ describe('Agent + State Integration', () => {
191
+ it('updates character state after agent processes action', async () => {
192
+ const agent = new GameAgent();
193
+ const store = useGameStore.getState();
194
+
195
+ await agent.processAction('attack guard');
196
+
197
+ expect(store.character.stress).toBeGreaterThan(0);
198
+ expect(store.messages).toHaveLength(2);
199
+ });
200
+ });
201
+ ```
202
+
203
+ ### E2E Tests
204
+
205
+ ```typescript
206
+ test('user creates account and first item', async ({ page }) => {
207
+ await page.goto('/signup');
208
+ await page.fill('[name="email"]', 'test@example.com');
209
+ await page.fill('[name="password"]', 'secure123');
210
+ await page.click('button:has-text("Sign Up")');
211
+ await expect(page).toHaveURL('/dashboard');
212
+
213
+ await page.click('text=New Item');
214
+ await page.fill('[name="title"]', 'My First Item');
215
+ await page.click('text=Save');
216
+ await expect(page.getByText('My First Item')).toBeVisible();
217
+ });
218
+ ```
219
+
220
+ ### LLM Evaluations
221
+
222
+ ```yaml
223
+ - description: 'Infer user intent from casual input'
224
+ vars:
225
+ input: 'I want to order a large pepperoni'
226
+ assert:
227
+ - type: javascript
228
+ value: JSON.parse(output).intent === 'order_pizza'
229
+ - type: llm-rubric
230
+ value: |
231
+ EXCELLENT: Confirms pizza type/size, asks for delivery details
232
+ POOR: Generic response or wrong intent
233
+ ```
234
+
235
+ ---
236
+
237
+ ## E2E Testing with Persistent Dev Servers
238
+
239
+ Isolate persistent dev instances from test instances to avoid port conflicts.
240
+
241
+ **Port Isolation Strategy:**
242
+
243
+ - **Dev instance**: Project's configured port (e.g., 3000) - runs persistently
244
+ - **Test instances**: `devPort + 1000` (e.g., 4000) - managed by Playwright
245
+
246
+ **Playwright Configuration:**
247
+
248
+ ```typescript
249
+ // playwright.config.ts
250
+ export default defineConfig({
251
+ webServer: {
252
+ command: 'npm run dev:test',
253
+ port: 4000,
254
+ reuseExistingServer: !process.env.CI,
255
+ timeout: 120000,
256
+ },
257
+ use: {
258
+ baseURL: 'http://localhost:4000',
259
+ },
260
+ });
261
+ ```
262
+
263
+ **Package.json Scripts:**
264
+
265
+ ```json
266
+ {
267
+ "scripts": {
268
+ "dev": "vite --port 3000",
269
+ "dev:test": "vite --port 4000",
270
+ "test:e2e": "playwright test"
271
+ }
272
+ }
273
+ ```
274
+
275
+ **Cleanup:** See `@./.safeword/guides/zombie-process-cleanup.md` for killing zombie servers.
276
+
277
+ ---
278
+
279
+ ## Writing Effective Tests
280
+
281
+ ### AAA Pattern (Arrange-Act-Assert)
282
+
283
+ ```typescript
284
+ it('applies discount to VIP users', () => {
285
+ const user = { tier: 'VIP' },
286
+ cart = { total: 100 }; // Arrange
287
+ const result = applyDiscount(user, cart); // Act
288
+ expect(result.total).toBe(80); // Assert
289
+ });
290
+ ```
291
+
292
+ ### Test Naming
293
+
294
+ Be descriptive and specific:
295
+
296
+ ```typescript
297
+ // ✅ GOOD
298
+ it('returns 401 when API key is missing');
299
+ it('preserves user input after validation error');
300
+
301
+ // ❌ BAD
302
+ it('works correctly');
303
+ it('should call setState');
304
+ ```
305
+
306
+ ### Test Independence
307
+
308
+ Each test should:
309
+
310
+ - Run in any order
311
+ - Not depend on other tests
312
+ - Clean up its own state
313
+ - Use fresh fixtures/data
314
+
315
+ ```typescript
316
+ // ✅ GOOD - Fresh state per test
317
+ beforeEach(() => {
318
+ gameState = createFreshGameState();
319
+ });
320
+
321
+ // ❌ BAD - Shared state
322
+ let sharedUser = createUser();
323
+ it('test A', () => {
324
+ sharedUser.name = 'Alice';
325
+ });
326
+ it('test B', () => {
327
+ expect(sharedUser.name).toBe('Alice'); // Depends on A!
328
+ });
329
+ ```
330
+
331
+ ### Test Data Builders
332
+
333
+ Use builders for complex test data:
334
+
335
+ ```typescript
336
+ function buildCharacter(overrides = {}) {
337
+ return {
338
+ id: 'test-char-1',
339
+ name: 'Cutter',
340
+ playbook: 'Cutter',
341
+ stress: 0,
342
+ ...overrides,
343
+ };
344
+ }
345
+
346
+ it('should increase stress when resisting', () => {
347
+ const character = buildCharacter({ stress: 3 });
348
+ // Test uses character with stress=3
349
+ });
350
+ ```
351
+
352
+ ### Async Testing
353
+
354
+ **NEVER use arbitrary timeouts:**
355
+
356
+ ```typescript
357
+ // ❌ BAD - Arbitrary timeout
358
+ await page.waitForTimeout(3000);
359
+ await sleep(500);
360
+
361
+ // ✅ GOOD - Poll until condition
362
+ await expect.poll(() => getStatus()).toBe('ready');
363
+ await page.waitForSelector('[data-testid="loaded"]');
364
+ await waitFor(() => expect(screen.getByText('Success')).toBeVisible());
365
+ ```
366
+
367
+ ---
368
+
369
+ ## What Not to Test
370
+
371
+ ❌ **Implementation details** - Private methods, CSS classes, internal state
372
+ ❌ **Third-party libraries** - Assume React/Axios work, test YOUR code
373
+ ❌ **Trivial code** - Getters/setters with no logic, pass-through functions
374
+ ❌ **UI copy** - Use regex `/submit/i`, not exact text matching
375
+
376
+ ---
377
+
378
+ ## Coverage Goals
379
+
380
+ - **Unit tests:** 80%+ coverage of pure functions
381
+ - **Integration tests:** All critical paths covered
382
+ - **E2E tests:** All critical multi-page user flows
383
+ - **LLM evals:** All AI features have evaluation scenarios
384
+
385
+ **What are "critical paths"?**
386
+
387
+ - **Always critical:** Authentication, payment/checkout, data loss scenarios
388
+ - **Usually critical:** Core user workflows, primary feature flows
389
+ - **Rarely critical:** UI polish, admin-only features with low usage
390
+ - **Rule of thumb:** If it breaks, would users notice immediately?
391
+
392
+ ---
393
+
394
+ ## LLM Eval Cost Considerations
395
+
396
+ **Cost:** $0.01-0.30 per run depending on prompt size.
397
+
398
+ **Prompt caching reduces costs by 90%** (30 scenarios: $0.30 → $0.03 after first run).
399
+
400
+ **Cost reduction strategies:**
401
+
402
+ - Cache static content (system prompts, examples, rules)
403
+ - Batch multiple scenarios in one run
404
+ - Run full evals on PR/schedule, not every commit
405
+
406
+ ---
407
+
408
+ ## CI/CD Integration
409
+
410
+ - Run unit + integration tests on every commit (fast feedback)
411
+ - Run E2E tests on every PR
412
+ - Run LLM evals on schedule (weekly catches regressions without per-commit cost)
413
+
414
+ ---
415
+
416
+ ## Quick Reference
417
+
418
+ | Need to test... | Test type | Technology | Speed | Cost |
419
+ | -------------------- | ----------- | ---------- | ------ | ---------- |
420
+ | Pure function | Unit | Vitest | Fast | Free |
421
+ | Service integration | Integration | Vitest | Medium | Free |
422
+ | Full user flow | E2E | Playwright | Slow | Free |
423
+ | AI reasoning quality | LLM eval | Promptfoo | Slow | $0.01-0.30 |
424
+
425
+ ---
426
+
427
+ ## Project-Specific Testing Documentation
428
+
429
+ **Location:** `tests/SAFEWORD.md` or `tests/AGENTS.md`
430
+
431
+ **What to include:**
432
+
433
+ - Tech stack (Vitest/Jest, Playwright/Cypress, Promptfoo)
434
+ - Test commands (how to run tests, single-file execution)
435
+ - Setup requirements (API keys, build steps, database)
436
+ - File structure and naming conventions
437
+ - Coverage requirements and PR requirements
438
+
439
+ **If not found:** Ask user "Where are the testing docs?"
File without changes
@@ -270,4 +270,4 @@ See: @./.safeword/scripts/bisect-zombie-processes.sh
270
270
 
271
271
  - Process cleanup guide: @./.safeword/guides/zombie-process-cleanup.md
272
272
  - Debug logging style: @./.safeword/guides/code-philosophy.md
273
- - TDD for fix verification: @./.safeword/guides/tdd-best-practices.md
273
+ - TDD for fix verification: @./.safeword/guides/testing-guide.md
@@ -138,15 +138,37 @@ Before starting Phase 1, create or open a work log:
138
138
  - [ ] No hardcoded/mock values
139
139
  - [ ] Committed
140
140
 
141
+ ### Verification Gate
142
+
143
+ **Before claiming GREEN:** Evidence before claims, always.
144
+
145
+ ```text
146
+ ✅ CORRECT ❌ WRONG
147
+ ───────────────────────────────── ─────────────────────────────────
148
+ Run: npm test "Tests should pass now"
149
+ Output: ✓ 34/34 tests pass "I'm confident this works"
150
+ Claim: "All tests pass" "Tests pass" (no output shown)
151
+ ```
152
+
153
+ **The Rule:** If you haven't run the verification command in this response, you cannot claim it passes.
154
+
155
+ | Claim | Requires | Not Sufficient |
156
+ | ---------------- | ----------------------------- | --------------------------- |
157
+ | "Tests pass" | Fresh test output: 0 failures | "should pass", previous run |
158
+ | "Build succeeds" | Build command: exit 0 | "linter passed" |
159
+ | "Bug fixed" | Original symptom test passes | "code changed" |
160
+
141
161
  **Red Flags → STOP:**
142
162
 
143
- | Flag | Action |
144
- | ------------------- | -------------------------------------- |
145
- | "Just in case" code | Delete it |
146
- | Multiple functions | Delete extras |
147
- | Refactoring | Stop - that's Phase 3 |
148
- | Test still fails | Debug (→ systematic-debugger if stuck) |
149
- | Hardcoded value | Implement real logic (see below) |
163
+ | Flag | Action |
164
+ | --------------------------- | -------------------------------------- |
165
+ | "should", "probably" claims | Run command, show output first |
166
+ | "Done!" before verification | Run command, show output first |
167
+ | "Just in case" code | Delete it |
168
+ | Multiple functions | Delete extras |
169
+ | Refactoring | Stop - that's Phase 3 |
170
+ | Test still fails | Debug (→ systematic-debugger if stuck) |
171
+ | Hardcoded value | Implement real logic (see below) |
150
172
 
151
173
  ### Anti-Pattern: Mock Implementations
152
174
 
@@ -241,6 +263,5 @@ Phase 0: L0 → create minimal spec → Phase 1: no new test (existing tests cov
241
263
 
242
264
  ## Related
243
265
 
244
- - @./.safeword/guides/test-definitions-guide.md
245
- - @./.safeword/guides/tdd-best-practices.md
246
- - @./.safeword/guides/development-workflow.md
266
+ - @./.safeword/guides/planning-guide.md
267
+ - @./.safeword/guides/testing-guide.md
@@ -1,9 +0,0 @@
1
- import {
2
- sync
3
- } from "./chunk-V5T3TEEQ.js";
4
- import "./chunk-LETSGOTR.js";
5
- import "./chunk-ORQHKDT2.js";
6
- export {
7
- sync
8
- };
9
- //# sourceMappingURL=sync-AAG4SP5F.js.map