npm - safeword - Versions diffs - 0.7.7 → 0.8.1 - Mend

safeword 0.7.7 → 0.8.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (34) hide show

package/templates/guides/testing-guide.md ADDED Viewed

@@ -0,0 +1,439 @@
+# Testing Guide
+Test methodology, TDD workflow, and test type selection.
+---
+## Test Philosophy
+**Test what matters** - Focus on user experience and delivered features, not implementation details.
+**Always test what you build** - Run tests yourself before completion. Don't ask the user to verify.
+---
+## Test Integrity (CRITICAL)
+**NEVER modify, skip, or delete tests without explicit human approval.**
+Tests are the specification. When a test fails, the implementation is wrong—not the test.
+### Forbidden Actions (Require Approval)
+| Action                                          | Why It's Forbidden                |
+| ----------------------------------------------- | --------------------------------- |
+| Changing assertions to match broken code        | Hides bugs instead of fixing them |
+| Adding `.skip()`, `.only()`, `xit()`, `.todo()` | Makes failures invisible          |
+| Deleting tests you can't get passing            | Removes coverage for edge cases   |
+| Weakening assertions (`toBe` → `toBeTruthy`)    | Reduces test precision            |
+| Commenting out test code                        | Same as skipping                  |
+### What To Do Instead
+1. **Test fails?** → Fix the implementation, not the test
+2. **Test seems wrong?** → Explain why and ask before updating
+3. **Requirements changed?** → Explain the change and ask before updating tests
+4. **Test is flaky?** → Fix the flakiness (usually async issues), don't skip it
+5. **Test blocks progress?** → Ask for guidance, don't work around it
+---
+## Test Speed Hierarchy
+**Goal:** Catch bugs quickly and cheaply with fast feedback loops.
+**Rule:** Test with the fastest test type that can catch the bug.
+```text
+Unit (milliseconds)      ← Pure functions, no I/O
+  ↓
+Integration (seconds)    ← Multiple modules, database, API calls
+  ↓
+LLM Eval (seconds)       ← AI judgment, costs $0.01-0.30 per run
+  ↓
+E2E (seconds-minutes)    ← Full browser, user flows
+```
+---
+## When to Use Each Test Type
+Answer these questions in order. Stop at first match.
+```text
+1. Does this test AI-generated content quality (tone, reasoning, creativity)?
+   └─ YES → LLM Evaluation
+   └─ NO → Continue to question 2
+2. Does this test require a real browser (Playwright/Cypress)?
+   └─ YES → E2E test
+      Examples: Multi-page navigation, browser-specific behavior, visual regression
+      Note: React Testing Library does NOT require a browser - that's integration
+   └─ NO → Continue to question 3
+3. Does this test interactions between multiple components/services?
+   └─ YES → Integration test
+      Examples: API + database, React component + state store
+   └─ NO → Continue to question 4
+4. Does this test a pure function (input → output, no I/O)?
+   └─ YES → Unit test
+      Examples: Calculations, formatters, validators, pure algorithms
+   └─ NO → Re-evaluate: What are you actually testing?
+```
+**Edge cases:**
+- Non-deterministic functions (Math.random, Date.now) → Unit test with mocked randomness
+- Functions with environment dependencies (process.env) → Integration test
+- Mixed pure + I/O logic → Extract pure logic, unit test it, integration test I/O
+---
+## Bug Detection Matrix
+Which test type catches which bug?
+| Bug Type                             | Unit? | Integration? | E2E? | Best Choice     |
+| ------------------------------------ | ----- | ------------ | ---- | --------------- |
+| Calculation error                    | ✅    | ✅           | ✅   | Unit (fastest)  |
+| Invalid input handling               | ✅    | ✅           | ✅   | Unit (fastest)  |
+| Database query returning wrong data  | ❌    | ✅           | ✅   | Integration     |
+| API endpoint contract violation      | ❌    | ✅           | ✅   | Integration     |
+| Race condition between services      | ❌    | ✅           | ✅   | Integration     |
+| State management bug                 | ❌    | ✅           | ✅   | Integration     |
+| React component rendering wrong data | ❌    | ✅           | ✅   | Integration     |
+| CSS layout broken                    | ❌    | ❌           | ✅   | E2E (only)      |
+| Multi-page navigation broken         | ❌    | ❌           | ✅   | E2E (only)      |
+| Browser-specific rendering           | ❌    | ❌           | ✅   | E2E (only)      |
+| AI prompt quality degradation        | ❌    | ❌           | ❌   | LLM Eval (only) |
+| AI reasoning accuracy                | ❌    | ❌           | ❌   | LLM Eval (only) |
+**Key principle:** If multiple test types can catch the bug, choose the fastest one.
+---
+## TDD Workflow (RED → GREEN → REFACTOR)
+Write tests BEFORE implementation. Tests define expected behavior, code makes them pass.
+### Phase 1: RED (Write Failing Test)
+1. Write test based on expected input/output
+2. **CRITICAL:** Run test and confirm it fails for the right reason
+3. **DO NOT write any implementation code yet**
+4. Commit: `test: [behavior]`
+**Red Flags → STOP:**
+| Flag                    | Action                           |
+| ----------------------- | -------------------------------- |
+| Test passes immediately | Rewrite - you're testing nothing |
+| Syntax error            | Fix syntax, not behavior         |
+| Wrote implementation    | Delete it, return to test        |
+| Multiple tests          | Pick ONE                         |
+### Phase 2: GREEN (Make Test Pass)
+1. Write **minimum** code to make test pass
+2. Run test - verify it passes
+3. No extra features (YAGNI)
+4. Commit: `feat:` or `fix:`
+**Anti-Pattern: Mock Implementations**
+LLMs sometimes hardcode values to pass tests. This is not TDD.
+```typescript
+// ❌ BAD - Hardcoded to pass test
+function calculateDiscount(amount, tier) {
+  return 80; // Passes test but isn't real
+}
+// ✅ GOOD - Actual logic
+function calculateDiscount(amount, tier) {
+  if (tier === 'VIP') return amount * 0.8;
+  return amount;
+}
+```
+### Phase 3: REFACTOR (Clean Up)
+1. Tests pass before changes
+2. Improve code (rename, extract, dedupe)
+3. Tests pass after changes
+4. Commit if changed: `refactor: [improvement]`
+**NOT Allowed:** New behavior, changing assertions, adding tests.
+---
+## Test Type Examples
+### Unit Tests
+```typescript
+// ✅ GOOD - Pure function
+it('applies 20% discount for VIP users', () => {
+  expect(calculateDiscount(100, { tier: 'VIP' })).toBe(80);
+});
+// ❌ BAD - Testing implementation details
+it('calls setState with correct value', () => {
+  expect(setState).toHaveBeenCalledWith({ count: 1 });
+});
+```
+### Integration Tests
+```typescript
+describe('Agent + State Integration', () => {
+  it('updates character state after agent processes action', async () => {
+    const agent = new GameAgent();
+    const store = useGameStore.getState();
+    await agent.processAction('attack guard');
+    expect(store.character.stress).toBeGreaterThan(0);
+    expect(store.messages).toHaveLength(2);
+  });
+});
+```
+### E2E Tests
+```typescript
+test('user creates account and first item', async ({ page }) => {
+  await page.goto('/signup');
+  await page.fill('[name="email"]', 'test@example.com');
+  await page.fill('[name="password"]', 'secure123');
+  await page.click('button:has-text("Sign Up")');
+  await expect(page).toHaveURL('/dashboard');
+  await page.click('text=New Item');
+  await page.fill('[name="title"]', 'My First Item');
+  await page.click('text=Save');
+  await expect(page.getByText('My First Item')).toBeVisible();
+});
+```
+### LLM Evaluations
+```yaml
+- description: 'Infer user intent from casual input'
+  vars:
+    input: 'I want to order a large pepperoni'
+  assert:
+    - type: javascript
+      value: JSON.parse(output).intent === 'order_pizza'
+    - type: llm-rubric
+      value: |
+        EXCELLENT: Confirms pizza type/size, asks for delivery details
+        POOR: Generic response or wrong intent
+```
+---
+## E2E Testing with Persistent Dev Servers
+Isolate persistent dev instances from test instances to avoid port conflicts.
+**Port Isolation Strategy:**
+- **Dev instance**: Project's configured port (e.g., 3000) - runs persistently
+- **Test instances**: `devPort + 1000` (e.g., 4000) - managed by Playwright
+**Playwright Configuration:**
+```typescript
+// playwright.config.ts
+export default defineConfig({
+  webServer: {
+    command: 'npm run dev:test',
+    port: 4000,
+    reuseExistingServer: !process.env.CI,
+    timeout: 120000,
+  },
+  use: {
+    baseURL: 'http://localhost:4000',
+  },
+});
+```
+**Package.json Scripts:**
+```json
+{
+  "scripts": {
+    "dev": "vite --port 3000",
+    "dev:test": "vite --port 4000",
+    "test:e2e": "playwright test"
+  }
+}
+```
+**Cleanup:** See `@./.safeword/guides/zombie-process-cleanup.md` for killing zombie servers.
+---
+## Writing Effective Tests
+### AAA Pattern (Arrange-Act-Assert)
+```typescript
+it('applies discount to VIP users', () => {
+  const user = { tier: 'VIP' },
+    cart = { total: 100 }; // Arrange
+  const result = applyDiscount(user, cart); // Act
+  expect(result.total).toBe(80); // Assert
+});
+```
+### Test Naming
+Be descriptive and specific:
+```typescript
+// ✅ GOOD
+it('returns 401 when API key is missing');
+it('preserves user input after validation error');
+// ❌ BAD
+it('works correctly');
+it('should call setState');
+```
+### Test Independence
+Each test should:
+- Run in any order
+- Not depend on other tests
+- Clean up its own state
+- Use fresh fixtures/data
+```typescript
+// ✅ GOOD - Fresh state per test
+beforeEach(() => {
+  gameState = createFreshGameState();
+});
+// ❌ BAD - Shared state
+let sharedUser = createUser();
+it('test A', () => {
+  sharedUser.name = 'Alice';
+});
+it('test B', () => {
+  expect(sharedUser.name).toBe('Alice'); // Depends on A!
+});
+```
+### Test Data Builders
+Use builders for complex test data:
+```typescript
+function buildCharacter(overrides = {}) {
+  return {
+    id: 'test-char-1',
+    name: 'Cutter',
+    playbook: 'Cutter',
+    stress: 0,
+    ...overrides,
+  };
+}
+it('should increase stress when resisting', () => {
+  const character = buildCharacter({ stress: 3 });
+  // Test uses character with stress=3
+});
+```
+### Async Testing
+**NEVER use arbitrary timeouts:**
+```typescript
+// ❌ BAD - Arbitrary timeout
+await page.waitForTimeout(3000);
+await sleep(500);
+// ✅ GOOD - Poll until condition
+await expect.poll(() => getStatus()).toBe('ready');
+await page.waitForSelector('[data-testid="loaded"]');
+await waitFor(() => expect(screen.getByText('Success')).toBeVisible());
+```
+---
+## What Not to Test
+❌ **Implementation details** - Private methods, CSS classes, internal state
+❌ **Third-party libraries** - Assume React/Axios work, test YOUR code
+❌ **Trivial code** - Getters/setters with no logic, pass-through functions
+❌ **UI copy** - Use regex `/submit/i`, not exact text matching
+---
+## Coverage Goals
+- **Unit tests:** 80%+ coverage of pure functions
+- **Integration tests:** All critical paths covered
+- **E2E tests:** All critical multi-page user flows
+- **LLM evals:** All AI features have evaluation scenarios
+**What are "critical paths"?**
+- **Always critical:** Authentication, payment/checkout, data loss scenarios
+- **Usually critical:** Core user workflows, primary feature flows
+- **Rarely critical:** UI polish, admin-only features with low usage
+- **Rule of thumb:** If it breaks, would users notice immediately?
+---
+## LLM Eval Cost Considerations
+**Cost:** $0.01-0.30 per run depending on prompt size.
+**Prompt caching reduces costs by 90%** (30 scenarios: $0.30 → $0.03 after first run).
+**Cost reduction strategies:**
+- Cache static content (system prompts, examples, rules)
+- Batch multiple scenarios in one run
+- Run full evals on PR/schedule, not every commit
+---
+## CI/CD Integration
+- Run unit + integration tests on every commit (fast feedback)
+- Run E2E tests on every PR
+- Run LLM evals on schedule (weekly catches regressions without per-commit cost)
+---
+## Quick Reference
+| Need to test...      | Test type   | Technology | Speed  | Cost       |
+| -------------------- | ----------- | ---------- | ------ | ---------- |
+| Pure function        | Unit        | Vitest     | Fast   | Free       |
+| Service integration  | Integration | Vitest     | Medium | Free       |
+| Full user flow       | E2E         | Playwright | Slow   | Free       |
+| AI reasoning quality | LLM eval    | Promptfoo  | Slow   | $0.01-0.30 |
+---
+## Project-Specific Testing Documentation
+**Location:** `tests/SAFEWORD.md` or `tests/AGENTS.md`
+**What to include:**
+- Tech stack (Vitest/Jest, Playwright/Cypress, Promptfoo)
+- Test commands (how to run tests, single-file execution)
+- Setup requirements (API keys, build steps, database)
+- File structure and naming conventions
+- Coverage requirements and PR requirements
+**If not found:** Ask user "Where are the testing docs?"

package/templates/scripts/lint-md.sh CHANGED Viewed

File without changes

package/templates/skills/safeword-systematic-debugger/SKILL.md CHANGED Viewed

@@ -270,4 +270,4 @@ See: @./.safeword/scripts/bisect-zombie-processes.sh
 - Process cleanup guide: @./.safeword/guides/zombie-process-cleanup.md
 - Debug logging style: @./.safeword/guides/code-philosophy.md
-- TDD for fix verification: @./.safeword/guides/tdd-best-practices.md
+- TDD for fix verification: @./.safeword/guides/testing-guide.md

package/templates/skills/safeword-tdd-enforcer/SKILL.md CHANGED Viewed

@@ -138,15 +138,37 @@ Before starting Phase 1, create or open a work log:
 - [ ] No hardcoded/mock values
 - [ ] Committed
+### Verification Gate
+**Before claiming GREEN:** Evidence before claims, always.
+```text
+✅ CORRECT                          ❌ WRONG
+─────────────────────────────────   ─────────────────────────────────
+Run: npm test                       "Tests should pass now"
+Output: ✓ 34/34 tests pass          "I'm confident this works"
+Claim: "All tests pass"             "Tests pass" (no output shown)
+```
+**The Rule:** If you haven't run the verification command in this response, you cannot claim it passes.
+| Claim            | Requires                      | Not Sufficient              |
+| ---------------- | ----------------------------- | --------------------------- |
+| "Tests pass"     | Fresh test output: 0 failures | "should pass", previous run |
+| "Build succeeds" | Build command: exit 0         | "linter passed"             |
+| "Bug fixed"      | Original symptom test passes  | "code changed"              |
 **Red Flags → STOP:**
-| Flag                | Action                                 |
-| ------------------- | -------------------------------------- |
-| "Just in case" code | Delete it                              |
-| Multiple functions  | Delete extras                          |
-| Refactoring         | Stop - that's Phase 3                  |
-| Test still fails    | Debug (→ systematic-debugger if stuck) |
-| Hardcoded value     | Implement real logic (see below)       |
+| Flag                        | Action                                 |
+| --------------------------- | -------------------------------------- |
+| "should", "probably" claims | Run command, show output first         |
+| "Done!" before verification | Run command, show output first         |
+| "Just in case" code         | Delete it                              |
+| Multiple functions          | Delete extras                          |
+| Refactoring                 | Stop - that's Phase 3                  |
+| Test still fails            | Debug (→ systematic-debugger if stuck) |
+| Hardcoded value             | Implement real logic (see below)       |
 ### Anti-Pattern: Mock Implementations
@@ -241,6 +263,5 @@ Phase 0: L0 → create minimal spec → Phase 1: no new test (existing tests cov
 ## Related
-- @./.safeword/guides/test-definitions-guide.md
-- @./.safeword/guides/tdd-best-practices.md
-- @./.safeword/guides/development-workflow.md
+- @./.safeword/guides/planning-guide.md
+- @./.safeword/guides/testing-guide.md

package/dist/sync-AAG4SP5F.js DELETED Viewed

@@ -1,9 +0,0 @@
-import {
-  sync
-} from "./chunk-V5T3TEEQ.js";
-import "./chunk-LETSGOTR.js";
-import "./chunk-ORQHKDT2.js";
-export {
-  sync
-};
-//# sourceMappingURL=sync-AAG4SP5F.js.map