npm - safeword - Versions diffs - 0.2.4 → 0.2.5 - Mend

safeword 0.2.4 → 0.2.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (235) hide show

package/dist/check-3NGQ4NR5.js +129 -0
package/dist/check-3NGQ4NR5.js.map +1 -0
package/dist/chunk-2XWIUEQK.js +190 -0
package/dist/chunk-2XWIUEQK.js.map +1 -0
package/dist/chunk-GZRQL3SX.js +146 -0
package/dist/chunk-GZRQL3SX.js.map +1 -0
package/dist/chunk-ORQHKDT2.js +10 -0
package/dist/chunk-ORQHKDT2.js.map +1 -0
package/dist/chunk-W66Z3C5H.js +21 -0
package/dist/chunk-W66Z3C5H.js.map +1 -0
package/dist/cli.d.ts +1 -0
package/dist/cli.js +34 -0
package/dist/cli.js.map +1 -0
package/dist/diff-Y6QTAW4O.js +166 -0
package/dist/diff-Y6QTAW4O.js.map +1 -0
package/dist/index.d.ts +11 -0
package/dist/index.js +7 -0
package/dist/index.js.map +1 -0
package/dist/reset-3ACTIYYE.js +143 -0
package/dist/reset-3ACTIYYE.js.map +1 -0
package/dist/setup-RR4M334C.js +266 -0
package/dist/setup-RR4M334C.js.map +1 -0
package/dist/upgrade-6AR3DHUV.js +134 -0
package/dist/upgrade-6AR3DHUV.js.map +1 -0
package/package.json +44 -19
package/{.safeword → templates}/hooks/agents-md-check.sh +0 -0
package/{.safeword → templates}/hooks/post-tool.sh +0 -0
package/{.safeword → templates}/hooks/pre-commit.sh +0 -0
package/.claude/commands/arch-review.md +0 -32
package/.claude/commands/lint.md +0 -6
package/.claude/commands/quality-review.md +0 -13
package/.claude/commands/setup-linting.md +0 -6
package/.claude/hooks/auto-lint.sh +0 -6
package/.claude/hooks/auto-quality-review.sh +0 -170
package/.claude/hooks/check-linting-sync.sh +0 -17
package/.claude/hooks/inject-timestamp.sh +0 -6
package/.claude/hooks/question-protocol.sh +0 -12
package/.claude/hooks/run-linters.sh +0 -8
package/.claude/hooks/run-quality-review.sh +0 -76
package/.claude/hooks/version-check.sh +0 -10
package/.claude/mcp/README.md +0 -96
package/.claude/mcp/arcade.sample.json +0 -9
package/.claude/mcp/context7.sample.json +0 -7
package/.claude/mcp/playwright.sample.json +0 -7
package/.claude/settings.json +0 -62
package/.claude/skills/quality-reviewer/SKILL.md +0 -190
package/.claude/skills/safeword-quality-reviewer/SKILL.md +0 -13
package/.env.arcade.example +0 -4
package/.env.example +0 -11
package/.gitmodules +0 -4
package/.safeword/SAFEWORD.md +0 -33
package/.safeword/eslint/eslint-base.mjs +0 -101
package/.safeword/guides/architecture-guide.md +0 -404
package/.safeword/guides/code-philosophy.md +0 -174
package/.safeword/guides/context-files-guide.md +0 -405
package/.safeword/guides/data-architecture-guide.md +0 -183
package/.safeword/guides/design-doc-guide.md +0 -165
package/.safeword/guides/learning-extraction.md +0 -515
package/.safeword/guides/llm-instruction-design.md +0 -239
package/.safeword/guides/llm-prompting.md +0 -95
package/.safeword/guides/tdd-best-practices.md +0 -570
package/.safeword/guides/test-definitions-guide.md +0 -243
package/.safeword/guides/testing-methodology.md +0 -573
package/.safeword/guides/user-story-guide.md +0 -237
package/.safeword/guides/zombie-process-cleanup.md +0 -214
package/.safeword/planning/002-user-story-quality-evaluation.md +0 -1840
package/.safeword/planning/003-langsmith-eval-setup-prompt.md +0 -363
package/.safeword/planning/004-llm-eval-test-cases.md +0 -3226
package/.safeword/planning/005-architecture-enforcement-system.md +0 -169
package/.safeword/planning/006-reactive-fix-prevention-research.md +0 -135
package/.safeword/planning/011-cli-ux-vision.md +0 -330
package/.safeword/planning/012-project-structure-cleanup.md +0 -154
package/.safeword/planning/README.md +0 -39
package/.safeword/planning/automation-plan-v2.md +0 -1225
package/.safeword/planning/automation-plan-v3.md +0 -1291
package/.safeword/planning/automation-plan.md +0 -3058
package/.safeword/planning/design/005-cli-implementation.md +0 -343
package/.safeword/planning/design/013-cli-self-contained-templates.md +0 -596
package/.safeword/planning/design/013a-eslint-plugin-suite.md +0 -256
package/.safeword/planning/design/013b-implementation-snippets.md +0 -385
package/.safeword/planning/design/013c-config-isolation-strategy.md +0 -242
package/.safeword/planning/design/code-philosophy-improvements.md +0 -60
package/.safeword/planning/mcp-analysis.md +0 -545
package/.safeword/planning/phase2-subagents-vs-skills-analysis.md +0 -451
package/.safeword/planning/settings-improvements.md +0 -970
package/.safeword/planning/test-definitions/005-cli-implementation.md +0 -1301
package/.safeword/planning/test-definitions/cli-self-contained-templates.md +0 -205
package/.safeword/planning/user-stories/001-guides-review-user-stories.md +0 -1381
package/.safeword/planning/user-stories/003-reactive-fix-prevention.md +0 -132
package/.safeword/planning/user-stories/004-technical-constraints.md +0 -86
package/.safeword/planning/user-stories/005-cli-implementation.md +0 -311
package/.safeword/planning/user-stories/cli-self-contained-templates.md +0 -172
package/.safeword/planning/versioned-distribution.md +0 -740
package/.safeword/prompts/arch-review.md +0 -43
package/.safeword/prompts/quality-review.md +0 -11
package/.safeword/scripts/arch-review.sh +0 -235
package/.safeword/scripts/check-linting-sync.sh +0 -58
package/.safeword/scripts/setup-linting.sh +0 -559
package/.safeword/templates/architecture-template.md +0 -136
package/.safeword/templates/ci/architecture-check.yml +0 -79
package/.safeword/templates/design-doc-template.md +0 -127
package/.safeword/templates/test-definitions-feature.md +0 -100
package/.safeword/templates/ticket-template.md +0 -74
package/.safeword/templates/user-stories-template.md +0 -82
package/.safeword/tickets/001-guides-review-user-stories.md +0 -83
package/.safeword/tickets/002-architecture-enforcement.md +0 -211
package/.safeword/tickets/003-reactive-fix-prevention.md +0 -57
package/.safeword/tickets/004-technical-constraints-in-user-stories.md +0 -39
package/.safeword/tickets/005-cli-implementation.md +0 -248
package/.safeword/tickets/006-flesh-out-skills.md +0 -43
package/.safeword/tickets/007-flesh-out-questioning.md +0 -44
package/.safeword/tickets/008-upgrade-questioning.md +0 -58
package/.safeword/tickets/009-naming-conventions.md +0 -41
package/.safeword/tickets/010-safeword-md-cleanup.md +0 -34
package/.safeword/tickets/011-cursor-setup.md +0 -86
package/.safeword/tickets/README.md +0 -73
package/.safeword/version +0 -1
package/AGENTS.md +0 -59
package/CLAUDE.md +0 -12
package/README.md +0 -347
package/docs/001-cli-implementation-plan.md +0 -856
package/docs/elite-dx-implementation-plan.md +0 -1034
package/framework/README.md +0 -131
package/framework/mcp/README.md +0 -96
package/framework/mcp/arcade.sample.json +0 -8
package/framework/mcp/context7.sample.json +0 -6
package/framework/mcp/playwright.sample.json +0 -6
package/framework/scripts/arch-review.sh +0 -235
package/framework/scripts/check-linting-sync.sh +0 -58
package/framework/scripts/load-env.sh +0 -49
package/framework/scripts/setup-claude.sh +0 -223
package/framework/scripts/setup-linting.sh +0 -559
package/framework/scripts/setup-quality.sh +0 -477
package/framework/scripts/setup-safeword.sh +0 -550
package/framework/templates/ci/architecture-check.yml +0 -78
package/learnings/ai-sdk-v5-breaking-changes.md +0 -178
package/learnings/e2e-test-zombie-processes.md +0 -231
package/learnings/milkdown-crepe-editor-property.md +0 -96
package/learnings/prosemirror-fragment-traversal.md +0 -119
package/packages/cli/AGENTS.md +0 -1
package/packages/cli/ARCHITECTURE.md +0 -279
package/packages/cli/package.json +0 -51
package/packages/cli/src/cli.ts +0 -63
package/packages/cli/src/commands/check.ts +0 -166
package/packages/cli/src/commands/diff.ts +0 -209
package/packages/cli/src/commands/reset.ts +0 -190
package/packages/cli/src/commands/setup.ts +0 -325
package/packages/cli/src/commands/upgrade.ts +0 -163
package/packages/cli/src/index.ts +0 -3
package/packages/cli/src/templates/config.ts +0 -58
package/packages/cli/src/templates/content.ts +0 -18
package/packages/cli/src/templates/index.ts +0 -12
package/packages/cli/src/utils/agents-md.ts +0 -66
package/packages/cli/src/utils/fs.ts +0 -179
package/packages/cli/src/utils/git.ts +0 -124
package/packages/cli/src/utils/hooks.ts +0 -29
package/packages/cli/src/utils/output.ts +0 -60
package/packages/cli/src/utils/project-detector.test.ts +0 -185
package/packages/cli/src/utils/project-detector.ts +0 -44
package/packages/cli/src/utils/version.ts +0 -28
package/packages/cli/src/version.ts +0 -6
package/packages/cli/templates/SAFEWORD.md +0 -776
package/packages/cli/templates/doc-templates/architecture-template.md +0 -136
package/packages/cli/templates/doc-templates/design-doc-template.md +0 -134
package/packages/cli/templates/doc-templates/test-definitions-feature.md +0 -131
package/packages/cli/templates/doc-templates/ticket-template.md +0 -82
package/packages/cli/templates/doc-templates/user-stories-template.md +0 -92
package/packages/cli/templates/guides/architecture-guide.md +0 -423
package/packages/cli/templates/guides/code-philosophy.md +0 -195
package/packages/cli/templates/guides/context-files-guide.md +0 -457
package/packages/cli/templates/guides/data-architecture-guide.md +0 -200
package/packages/cli/templates/guides/design-doc-guide.md +0 -171
package/packages/cli/templates/guides/learning-extraction.md +0 -552
package/packages/cli/templates/guides/llm-instruction-design.md +0 -248
package/packages/cli/templates/guides/llm-prompting.md +0 -102
package/packages/cli/templates/guides/tdd-best-practices.md +0 -615
package/packages/cli/templates/guides/test-definitions-guide.md +0 -334
package/packages/cli/templates/guides/testing-methodology.md +0 -618
package/packages/cli/templates/guides/user-story-guide.md +0 -256
package/packages/cli/templates/guides/zombie-process-cleanup.md +0 -219
package/packages/cli/templates/hooks/agents-md-check.sh +0 -27
package/packages/cli/templates/hooks/post-tool.sh +0 -4
package/packages/cli/templates/hooks/pre-commit.sh +0 -10
package/packages/cli/templates/prompts/arch-review.md +0 -43
package/packages/cli/templates/prompts/quality-review.md +0 -10
package/packages/cli/templates/skills/safeword-quality-reviewer/SKILL.md +0 -207
package/packages/cli/tests/commands/check.test.ts +0 -129
package/packages/cli/tests/commands/cli.test.ts +0 -89
package/packages/cli/tests/commands/diff.test.ts +0 -115
package/packages/cli/tests/commands/reset.test.ts +0 -310
package/packages/cli/tests/commands/self-healing.test.ts +0 -170
package/packages/cli/tests/commands/setup-blocking.test.ts +0 -71
package/packages/cli/tests/commands/setup-core.test.ts +0 -135
package/packages/cli/tests/commands/setup-git.test.ts +0 -139
package/packages/cli/tests/commands/setup-hooks.test.ts +0 -334
package/packages/cli/tests/commands/setup-linting.test.ts +0 -189
package/packages/cli/tests/commands/setup-noninteractive.test.ts +0 -80
package/packages/cli/tests/commands/setup-templates.test.ts +0 -181
package/packages/cli/tests/commands/upgrade.test.ts +0 -215
package/packages/cli/tests/helpers.ts +0 -243
package/packages/cli/tests/npm-package.test.ts +0 -83
package/packages/cli/tests/technical-constraints.test.ts +0 -96
package/packages/cli/tsconfig.json +0 -25
package/packages/cli/tsup.config.ts +0 -11
package/packages/cli/vitest.config.ts +0 -23
package/promptfoo.yaml +0 -3270
/package/{framework → templates}/SAFEWORD.md +0 -0
/package/{packages/cli/templates → templates}/commands/arch-review.md +0 -0
/package/{packages/cli/templates → templates}/commands/lint.md +0 -0
/package/{packages/cli/templates → templates}/commands/quality-review.md +0 -0
/package/{framework/templates → templates/doc-templates}/architecture-template.md +0 -0
/package/{framework/templates → templates/doc-templates}/design-doc-template.md +0 -0
/package/{framework/templates → templates/doc-templates}/test-definitions-feature.md +0 -0
/package/{framework/templates → templates/doc-templates}/ticket-template.md +0 -0
/package/{framework/templates → templates/doc-templates}/user-stories-template.md +0 -0
/package/{framework → templates}/guides/architecture-guide.md +0 -0
/package/{framework → templates}/guides/code-philosophy.md +0 -0
/package/{framework → templates}/guides/context-files-guide.md +0 -0
/package/{framework → templates}/guides/data-architecture-guide.md +0 -0
/package/{framework → templates}/guides/design-doc-guide.md +0 -0
/package/{framework → templates}/guides/learning-extraction.md +0 -0
/package/{framework → templates}/guides/llm-instruction-design.md +0 -0
/package/{framework → templates}/guides/llm-prompting.md +0 -0
/package/{framework → templates}/guides/tdd-best-practices.md +0 -0
/package/{framework → templates}/guides/test-definitions-guide.md +0 -0
/package/{framework → templates}/guides/testing-methodology.md +0 -0
/package/{framework → templates}/guides/user-story-guide.md +0 -0
/package/{framework → templates}/guides/zombie-process-cleanup.md +0 -0
/package/{packages/cli/templates → templates}/hooks/inject-timestamp.sh +0 -0
/package/{packages/cli/templates → templates}/lib/common.sh +0 -0
/package/{packages/cli/templates → templates}/lib/jq-fallback.sh +0 -0
/package/{packages/cli/templates → templates}/markdownlint.jsonc +0 -0
/package/{framework → templates}/prompts/arch-review.md +0 -0
/package/{framework → templates}/prompts/quality-review.md +0 -0
/package/{framework/skills/quality-reviewer → templates/skills/safeword-quality-reviewer}/SKILL.md +0 -0

package/.safeword/planning/003-langsmith-eval-setup-prompt.md DELETED Viewed

@@ -1,363 +0,0 @@
-# LangSmith Eval Setup Plan
-**Goal:** Test whether an LLM agent correctly follows SAFEWORD documentation/guides when given various scenarios.
-**Scale:** 140 user stories across 13 guides → expect 200+ test cases over time.
----
-## Phase 1: Foundation (Do First)
-### 1.1 Project Structure
-```
-evals/
-├── package.json                 # Dependencies: langsmith, langchain, openai/anthropic SDK
-├── tsconfig.json
-├── .env.example                 # LANGSMITH_API_KEY, OPENAI_API_KEY, etc.
-├── src/
-│   ├── config.ts                # LangSmith project config, model selection
-│   ├── context-loader.ts        # Load SAFEWORD.md + guides as context
-│   ├── evaluators/
-│   │   ├── section-presence.ts  # Check if output has required sections
-│   │   ├── doc-type-decision.ts # Check if correct doc type was chosen
-│   │   └── prerequisites.ts     # Check if prerequisites were verified
-│   └── runner.ts                # Execute evals, upload to LangSmith
-├── datasets/
-│   ├── architecture-guide/      # Tests for architecture-guide.md
-│   │   ├── create-doc.json      # Test 1: Create Architecture Doc
-│   │   ├── tech-choice.json     # Test 2: Doc Type Decision - Tech Choice
-│   │   └── ...
-│   ├── design-doc-guide/        # Tests for design-doc-guide.md
-│   └── ... (one folder per guide)
-└── scripts/
-    ├── run-all.ts               # Run full eval suite
-    ├── run-guide.ts             # Run evals for one guide
-    └── upload-datasets.ts       # Sync datasets to LangSmith
-```
-### 1.2 Core Dependencies
-```json
-{
-  "dependencies": {
-    "langsmith": "^0.1.0",
-    "openai": "^4.0.0",
-    "@anthropic-ai/sdk": "^0.30.0"
-  },
-  "devDependencies": {
-    "tsx": "^4.0.0",
-    "dotenv": "^16.0.0"
-  }
-}
-```
-### 1.3 Dataset Schema (Standard for All Tests)
-```typescript
-interface TestCase {
-  id: string; // e.g., "arch-001-create-doc"
-  guide: string; // e.g., "architecture-guide"
-  story_id: number; // Maps to user story number
-  input: string; // User prompt
-  context_files: string[]; // Which files to load as context
-  expected_behavior: string; // What the agent should do
-  rubric: {
-    excellent: string;
-    acceptable: string;
-    poor: string;
-  };
-}
-```
----
-## Phase 2: Evaluator Patterns (Reusable)
-### 2.1 Section Presence Evaluator
-Used by: Tests 1, 6 (any "Create X Doc" test)
-```typescript
-// evaluators/section-presence.ts
-const evaluate = (output: string, requiredSections: string[]) => {
-  const found = requiredSections.filter(s => output.includes(s));
-  const score = found.length / requiredSections.length;
-  if (score === 1) return { score: 1, label: 'excellent' };
-  if (score >= 0.8) return { score: 0.8, label: 'acceptable' };
-  return { score: 0.4, label: 'poor' };
-};
-```
-### 2.2 Decision Evaluator
-Used by: Tests 2, 3, 5, 8, 10 (any "Which doc type?" test)
-```typescript
-// evaluators/doc-type-decision.ts
-const evaluate = (output: string, expectedType: 'architecture' | 'design' | 'none') => {
-  const mentionsArch = /architecture\s*(doc|document)/i.test(output);
-  const mentionsDesign = /design\s*(doc|document)/i.test(output);
-  // Scoring logic based on expectedType
-};
-```
-### 2.3 Prerequisites Evaluator
-Used by: Tests 7 (any "Did agent check prerequisites?" test)
-```typescript
-// evaluators/prerequisites.ts
-const evaluate = (output: string) => {
-  const asksAboutStories = /user stor(y|ies)/i.test(output);
-  const asksAboutTestDefs = /test definition/i.test(output);
-  const offersToCreate = /create|offer|would you like/i.test(output);
-  // Excellent = checks before creating
-};
-```
-### 2.4 LLM-as-Judge (General Purpose)
-Used by: Complex cases where regex isn't enough
-```typescript
-// evaluators/llm-judge.ts
-const judgePrompt = `
-You are evaluating an AI coding assistant's response.
-Rubric:
-- EXCELLENT: {rubric.excellent}
-- ACCEPTABLE: {rubric.acceptable}
-- POOR: {rubric.poor}
-User Input: {input}
-Assistant Output: {output}
-Rate the response as EXCELLENT, ACCEPTABLE, or POOR. Explain briefly.
-`;
-```
----
-## Phase 3: Implementation Order
-### Week 1: Setup + First 10 Tests
-| Task                                | Output                    |
-| ----------------------------------- | ------------------------- |
-| Create `evals/` directory structure | Scaffolding               |
-| Install dependencies                | `package.json`            |
-| Create context loader               | Load SAFEWORD.md + guides |
-| Create LangSmith project            | `safeword-evals` project  |
-| Implement Tests 1-10 from prompt    | First dataset uploaded    |
-### Week 2: Evaluator Library
-| Task                       | Output                              |
-| -------------------------- | ----------------------------------- |
-| Section presence evaluator | Reusable for all "create doc" tests |
-| Decision evaluator         | Reusable for all "which doc?" tests |
-| Prerequisites evaluator    | Reusable for workflow tests         |
-| LLM-as-judge wrapper       | General fallback                    |
-### Week 3+: Scale Up
-| Guide                     | # Stories | Priority       |
-| ------------------------- | --------- | -------------- |
-| architecture-guide.md     | 11        | HIGH (4 done)  |
-| design-doc-guide.md       | 10        | HIGH (partial) |
-| testing-methodology.md    | 13        | MEDIUM         |
-| tdd-templates.md          | 16        | MEDIUM         |
-| llm-instruction-design.md | 15        | LOW            |
-| ...                       | ...       | ...            |
----
-## Phase 4: CI Integration (Defer)
-```yaml
-# .github/workflows/evals.yml
-name: Run Evals
-on:
-  push:
-    paths:
-      - 'framework/**' # Run when guides change
-  schedule:
-    - cron: '0 0 * * 0' # Weekly regression check
-jobs:
-  eval:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v4
-      - run: cd evals && npm ci
-      - run: npm run eval:all
-        env:
-          LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
-```
----
-## Cost Controls
-| Control             | Implementation                                                   |
-| ------------------- | ---------------------------------------------------------------- |
-| **Model selection** | Use GPT-4o-mini for judge (cheap), GPT-4o for target (expensive) |
-| **Sampling**        | Run subset in CI, full suite on demand                           |
-| **Caching**         | Cache context loading; SAFEWORD.md rarely changes                |
-| **Budget alerts**   | LangSmith has spend tracking                                     |
-**Estimated costs (200 test cases):**
-- Full suite run: ~$5-10 (depends on output length)
-- CI weekly run: ~$50/month
----
-## Naming Conventions
-```
-Test ID format: {guide-prefix}-{story-num}-{test-slug}
-Examples:
-- arch-001-create-doc         (architecture-guide, story 1)
-- arch-002-tech-choice        (architecture-guide, story 3)
-- design-001-create-doc       (design-doc-guide, story 2)
-- design-002-prereqs          (design-doc-guide, story 1)
-```
-| Guide                      | Prefix     |
-| -------------------------- | ---------- |
-| architecture-guide.md      | arch       |
-| design-doc-guide.md        | design     |
-| testing-methodology.md     | test       |
-| tdd-templates.md           | tdd        |
-| code-philosophy.md         | code       |
-| context-files-guide.md     | ctx        |
-| data-architecture-guide.md | data       |
-| learning-extraction.md     | learn      |
-| llm-instruction-design.md  | llm-instr  |
-| llm-prompting.md           | llm-prompt |
-| test-definitions-guide.md  | testdef    |
-| user-story-guide.md        | story      |
-| zombie-process-cleanup.md  | zombie     |
----
-## Success Criteria
-| Metric             | Target                                |
-| ------------------ | ------------------------------------- |
-| Test coverage      | ≥1 test per evaluated user story      |
-| Pass rate baseline | Establish baseline, track regressions |
-| Cost per run       | <$15 for full suite                   |
-| Run time           | <10 min for full suite                |
----
-## Files to Create (Initial Implementation)
-1. `evals/package.json` — Dependencies
-2. `evals/src/config.ts` — LangSmith + model config
-3. `evals/src/context-loader.ts` — Load guide files
-4. `evals/src/evaluators/section-presence.ts` — First evaluator
-5. `evals/src/runner.ts` — Execute tests
-6. `evals/datasets/architecture-guide/create-doc.json` — First test case
-7. `evals/scripts/run-all.ts` — Entry point
----
-## Related Files
-- Evaluation plan: `.safeword/planning/002-user-story-quality-evaluation.md`
-- User stories source: `.safeword/planning/user-stories/001-guides-review-user-stories.md`
-- Original prompt: See "Prompt to Use" section below
----
-## Prompt to Use in New Thread
-```
-I want to set up LLM evaluations using LangSmith for my AI coding agent framework (SAFEWORD).
-**Goal:** Test whether an LLM agent correctly follows the documentation/guides when given various scenarios.
-**Tech context:**
-- Framework location: /Users/alex/projects/safeword/framework/
-- Guides location: framework/guides/
-- Main instruction file: framework/SAFEWORD.md
-**What I need:**
-1. LangSmith project setup for evals
-2. Dataset creation with test scenarios
-3. Evaluator rubrics (LLM-as-judge)
-4. Integration with CI (optional, can defer)
-**Test scenarios to implement:**
-### Architecture Doc Tests (from architecture-guide.md)
-**Test 1: Create Architecture Doc**
-- Input: "Create an architecture doc for a new React + Supabase project"
-- Expected: Output contains all 10 required sections (Header, TOC, Overview, Data Principles, Data Model, Components, Data Flows, Key Decisions, Best Practices, Migration)
-- Rubric: EXCELLENT = all 10 sections with What/Why/Trade-off; ACCEPTABLE = 8+ sections; POOR = <8 sections
-**Test 2: Doc Type Decision - Tech Choice**
-- Input: "I need to document our decision to use PostgreSQL instead of MongoDB"
-- Expected: Agent chooses Architecture Doc (not Design Doc)
-- Rubric: EXCELLENT = correctly identifies Architecture Doc + explains why; POOR = suggests Design Doc
-**Test 3: Doc Type Decision - Feature**
-- Input: "I need to document how the user profile feature will work"
-- Expected: Agent chooses Design Doc (not Architecture Doc)
-- Rubric: EXCELLENT = correctly identifies Design Doc + checks for prerequisites; POOR = suggests Architecture Doc
-**Test 4: Decision Documentation**
-- Input: "Document our decision to use Redis for caching"
-- Expected: Output includes What, Why, Trade-off, Alternatives Considered
-- Rubric: EXCELLENT = all 4 fields with specifics; ACCEPTABLE = What/Why/Trade-off; POOR = missing Why or Trade-off
-**Test 5: Ambiguous Scenario - Tie-breaker**
-- Input: "I need to document adding a caching layer that will be used by multiple features"
-- Expected: Agent chooses Architecture Doc (affects 2+ features)
-- Rubric: EXCELLENT = Architecture Doc + cites tie-breaking rule; ACCEPTABLE = Architecture Doc; POOR = Design Doc
-### Design Doc Tests (from design-doc-guide.md)
-**Test 6: Create Design Doc**
-- Input: "Create a design doc for a three-pane layout feature"
-- Expected: Output has required sections (Architecture, Components with [N]/[N+1], User Flow, Key Decisions)
-- Rubric: EXCELLENT = all required sections + references user stories/test defs; ACCEPTABLE = missing 1-2 optional sections; POOR = missing User Flow or Components
-**Test 7: Prerequisites Check**
-- Input: "Create a design doc for the payment flow feature" (assume no user stories exist)
-- Expected: Agent asks about or offers to create user stories first
-- Rubric: EXCELLENT = checks prerequisites before creating; POOR = creates design doc without checking
-**Test 8: Complexity Assessment**
-- Input: "Do I need a design doc for adding a logout button?"
-- Expected: Agent says no (simple, <3 components, single user story)
-- Rubric: EXCELLENT = correctly assesses as too simple + explains why; POOR = recommends design doc
-### Edge Case Tests
-**Test 9: Scattered ADRs Migration**
-- Input: "Our project has 50 ADR files in docs/adr/. What should we do?"
-- Expected: Agent recommends consolidating into single ARCHITECTURE.md
-- Rubric: EXCELLENT = recommends consolidation + provides migration steps; POOR = suggests keeping ADRs
-**Test 10: Borderline Complexity**
-- Input: "I'm building a feature that touches exactly 3 components and has 2 user stories"
-- Expected: Agent recommends design doc (meets threshold)
-- Rubric: EXCELLENT = recommends design doc + cites complexity criteria; ACCEPTABLE = recommends design doc; POOR = says skip design doc
-**Evaluation approach:**
-- Use LLM-as-judge with Claude or GPT-4 as evaluator
-- Each test should have the SAFEWORD.md and relevant guide loaded as context
-- Track pass/fail rates over time to catch regressions
-Please help me set this up in LangSmith. Start by explaining the LangSmith concepts I need to know, then walk me through creating the first few test cases.
-```