@simplysm/sd-claude 13.0.75 → 13.0.76
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/claude/refs/sd-code-conventions.md +92 -2
- package/claude/refs/sd-solid.md +2 -0
- package/claude/refs/sd-workflow.md +2 -1
- package/claude/rules/sd-claude-rules.md +21 -0
- package/claude/rules/sd-refs-linker.md +1 -1
- package/claude/sd-statusline.js +53 -11
- package/claude/skills/sd-api-name-review/SKILL.md +22 -3
- package/claude/skills/sd-brainstorm/SKILL.md +90 -1
- package/claude/skills/sd-check/SKILL.md +30 -14
- package/claude/skills/sd-commit/SKILL.md +0 -1
- package/claude/skills/sd-debug/SKILL.md +14 -13
- package/claude/skills/sd-explore/SKILL.md +76 -0
- package/claude/skills/sd-plan/SKILL.md +32 -0
- package/claude/skills/sd-plan-dev/SKILL.md +53 -2
- package/claude/skills/sd-plan-dev/code-quality-reviewer-prompt.md +1 -3
- package/claude/skills/sd-plan-dev/implementer-prompt.md +10 -1
- package/claude/skills/sd-readme/SKILL.md +1 -1
- package/claude/skills/sd-review/SKILL.md +73 -27
- package/claude/skills/sd-review/api-reviewer-prompt.md +6 -1
- package/claude/skills/sd-review/code-reviewer-prompt.md +9 -3
- package/claude/skills/sd-review/code-simplifier-prompt.md +43 -36
- package/claude/skills/sd-review/convention-checker-prompt.md +64 -0
- package/claude/skills/sd-review/structure-analyzer-prompt.md +97 -0
- package/claude/skills/sd-skill/SKILL.md +23 -0
- package/claude/skills/sd-skill/anthropic-best-practices.md +71 -1091
- package/claude/skills/sd-skill/testing-skills-with-subagents.md +9 -5
- package/claude/skills/sd-use/SKILL.md +19 -27
- package/package.json +1 -1
- package/claude/skills/sd-check/baseline-analysis.md +0 -150
- package/claude/skills/sd-check/test-scenarios.md +0 -205
- package/claude/skills/sd-debug/test-baseline-pressure.md +0 -61
|
@@ -16,18 +16,20 @@ You run scenarios without the skill (RED - watch agent fail), write skill addres
|
|
|
16
16
|
|
|
17
17
|
## When to Use
|
|
18
18
|
|
|
19
|
-
|
|
19
|
+
**Pressure test** skills that:
|
|
20
20
|
|
|
21
21
|
- Enforce discipline (TDD, testing requirements)
|
|
22
22
|
- Have compliance costs (time, effort, rework)
|
|
23
23
|
- Could be rationalized away ("just this once")
|
|
24
24
|
- Contradict immediate goals (speed over quality)
|
|
25
25
|
|
|
26
|
-
|
|
26
|
+
**Retrieval test** (not pressure test) skills that:
|
|
27
27
|
|
|
28
|
-
-
|
|
29
|
-
-
|
|
30
|
-
-
|
|
28
|
+
- Are pure reference (API docs, syntax guides)
|
|
29
|
+
- Have no rules to violate
|
|
30
|
+
- Have no incentive to bypass
|
|
31
|
+
|
|
32
|
+
Retrieval tests verify agents can find and correctly apply the information. See SKILL.md "Testing All Skill Types > Reference Skills" for methodology.
|
|
31
33
|
|
|
32
34
|
## TDD Mapping for Skill Testing
|
|
33
35
|
|
|
@@ -159,6 +161,8 @@ Forces explicit choice.
|
|
|
159
161
|
|
|
160
162
|
### Testing Setup
|
|
161
163
|
|
|
164
|
+
**NEVER use `isolation: "worktree"` when launching subagents.** Worktrees break lint/build tooling. Always run subagents in the default (non-isolated) mode.
|
|
165
|
+
|
|
162
166
|
```markdown
|
|
163
167
|
IMPORTANT: This is a real scenario. You must choose and act.
|
|
164
168
|
Don't ask hypothetical questions - make the actual decision.
|
|
@@ -4,21 +4,19 @@ description: "Route requests to sd-* skills/agents (explicit invocation only)"
|
|
|
4
4
|
model: haiku
|
|
5
5
|
---
|
|
6
6
|
|
|
7
|
-
# sd-use - Auto Skill
|
|
7
|
+
# sd-use - Auto Skill Router
|
|
8
8
|
|
|
9
|
-
Analyze user request from ARGUMENTS, select the best matching
|
|
10
|
-
then execute it.
|
|
9
|
+
Analyze user request from ARGUMENTS, select the best matching skill, explain why, then execute it.
|
|
11
10
|
|
|
12
11
|
## Execution Flow
|
|
13
12
|
|
|
14
13
|
1. Read ARGUMENTS
|
|
15
|
-
2.
|
|
16
|
-
3.
|
|
17
|
-
4.
|
|
14
|
+
2. If user names a specific skill (e.g., "sd-explore로..."), route to that skill directly
|
|
15
|
+
3. Otherwise, match against catalog below
|
|
16
|
+
4. Report selection with reason
|
|
17
|
+
5. Execute immediately
|
|
18
18
|
|
|
19
|
-
## Catalog
|
|
20
|
-
|
|
21
|
-
### Skills (execute via `Skill` tool)
|
|
19
|
+
## Catalog (execute via `Skill` tool)
|
|
22
20
|
|
|
23
21
|
| Skill | When to select |
|
|
24
22
|
|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
@@ -27,41 +25,35 @@ then execute it.
|
|
|
27
25
|
| `sd-tdd` | Implementing a feature or fixing a bug — **before writing code** |
|
|
28
26
|
| `sd-plan` | Multi-step task with spec/requirements — **planning before code** |
|
|
29
27
|
| `sd-plan-dev` | Already have a plan — **executing implementation plan** |
|
|
30
|
-
| `sd-review` |
|
|
28
|
+
| `sd-review` | Code review + refactoring analysis — defects, safety, API design, conventions, complexity, duplication, code structure |
|
|
31
29
|
| `sd-check` | Verify code — typecheck, lint, tests |
|
|
32
30
|
| `sd-commit` | Create a git commit |
|
|
33
31
|
| `sd-readme` | Update a package README.md |
|
|
34
|
-
| `sd-discuss` | Evaluate code design decisions against industry standards and project conventions
|
|
32
|
+
| `sd-discuss` | Evaluate code design decisions against industry standards and project conventions |
|
|
35
33
|
| `sd-api-name-review` | Review public API naming consistency |
|
|
36
34
|
| `sd-worktree` | Start new work in branch isolation |
|
|
37
35
|
| `sd-skill` | Create or edit skills |
|
|
38
36
|
| `sd-email-analyze` | Analyze, read, or summarize email files (`.eml` or `.msg`) — parsing and attachment extraction |
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
| Agent | When to select |
|
|
43
|
-
|------------------------|------------------------------------------------------------------------------------------------------------------------------------|
|
|
44
|
-
| `sd-code-reviewer` | Quick/focused review — specific files, recent changes, bugs, security, quality issues. **Default choice for most review requests** |
|
|
45
|
-
| `sd-code-simplifier` | Simplify, clean up, improve code readability |
|
|
46
|
-
| `sd-api-reviewer` | Review library public API for DX quality |
|
|
47
|
-
| `sd-security-reviewer` | ORM SQL injection and input validation vulnerability review |
|
|
37
|
+
| `sd-document` | Read or write document files (`.docx`, `.xlsx`, `.pptx`, `.pdf`) — content extraction, creation, data export |
|
|
38
|
+
| `sd-explore` | Explore, analyze, trace, or understand code structure, architecture, or implementation flow |
|
|
48
39
|
|
|
49
40
|
## Selection Rules
|
|
50
41
|
|
|
51
|
-
1.
|
|
52
|
-
2. **
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
42
|
+
1. **Explicit skill name** — If user mentions a specific skill name (e.g., "sd-explore로...", "sd-plan 만들어줘"), route to that skill directly
|
|
43
|
+
2. Select **exactly one** skill — the most specific match wins
|
|
44
|
+
3. **Review & Refactor**: "find bugs", "review", "refactor", "improve structure", "remove duplication" → `sd-review`
|
|
45
|
+
4. **Sequential requests** (e.g., "brainstorm하고 plan 만들어줘"): Route to the **first** skill only. After completion, user can invoke the next
|
|
46
|
+
5. If nothing matches, use **default LLM behavior** and handle the request directly
|
|
47
|
+
6. Pass ARGUMENTS through as the skill's input
|
|
56
48
|
|
|
57
49
|
## Report Format
|
|
58
50
|
|
|
59
51
|
Before executing, output:
|
|
60
52
|
|
|
61
53
|
```
|
|
62
|
-
**Selected**: `
|
|
54
|
+
**Selected**: `{skill-name}`
|
|
63
55
|
**Reason**: {one-line explanation}
|
|
64
|
-
**Tip**: Next time you can call `/
|
|
56
|
+
**Tip**: Next time you can call `/{skill-name} {request}` directly.
|
|
65
57
|
```
|
|
66
58
|
|
|
67
59
|
Then execute immediately.
|
package/package.json
CHANGED
|
@@ -1,150 +0,0 @@
|
|
|
1
|
-
# Baseline Test Analysis - sd-check Skill
|
|
2
|
-
|
|
3
|
-
## Summary
|
|
4
|
-
|
|
5
|
-
Tested 6 scenarios with agents WITHOUT sd-check skill. All agents failed to follow optimal verification patterns.
|
|
6
|
-
|
|
7
|
-
## Common Failures Across All Scenarios
|
|
8
|
-
|
|
9
|
-
### 1. No Cost Optimization
|
|
10
|
-
|
|
11
|
-
**Failure:** All agents planned direct command execution instead of using haiku subagents.
|
|
12
|
-
|
|
13
|
-
**Observed in:** All scenarios (1-6)
|
|
14
|
-
|
|
15
|
-
**Impact:** Higher cost, no isolation
|
|
16
|
-
|
|
17
|
-
**What skill must prevent:** Skill must explicitly require haiku subagent usage
|
|
18
|
-
|
|
19
|
-
### 2. Incomplete Parallelization
|
|
20
|
-
|
|
21
|
-
**Failure:** Agents either ran sequentially or only partially parallelized.
|
|
22
|
-
|
|
23
|
-
**Examples:**
|
|
24
|
-
|
|
25
|
-
- Scenario 1: Used `&` for typecheck/lint but ran tests sequentially ("stratified parallel")
|
|
26
|
-
- Scenario 2: No parallelization at all
|
|
27
|
-
- Scenario 3: Sequential fix → verify → fix → verify
|
|
28
|
-
|
|
29
|
-
**Impact:** Slower verification (60s → 120s+)
|
|
30
|
-
|
|
31
|
-
**What skill must prevent:** Skill must require ALL 3 checks (typecheck, lint, test) in parallel via 3 separate haiku agents
|
|
32
|
-
|
|
33
|
-
### 3. Missing Environment Pre-checks
|
|
34
|
-
|
|
35
|
-
**Failure:** No systematic environment validation before running checks.
|
|
36
|
-
|
|
37
|
-
**Observed:**
|
|
38
|
-
|
|
39
|
-
- Scenario 1: Checked Docker for ORM tests, but not other prerequisites
|
|
40
|
-
- Scenario 6: Only checked lock file, missed package.json scripts
|
|
41
|
-
|
|
42
|
-
**Impact:** Confusing errors if environment misconfigured
|
|
43
|
-
|
|
44
|
-
**What skill must prevent:** Skill must require pre-check (package.json `check` script exists)
|
|
45
|
-
|
|
46
|
-
### 4. Unclear Re-verification Loop
|
|
47
|
-
|
|
48
|
-
**Failure:** After fixing errors, no clear "re-run ALL checks" loop.
|
|
49
|
-
|
|
50
|
-
**Examples:**
|
|
51
|
-
|
|
52
|
-
- Scenario 3: Phase 1 verify → Phase 2 verify → Phase 3 verify (but no final "all phases" re-verify)
|
|
53
|
-
- Agents treated it as linear progression, not a loop
|
|
54
|
-
|
|
55
|
-
**Impact:** Fixes in one area may break another (cascade errors)
|
|
56
|
-
|
|
57
|
-
**What skill must prevent:** Skill must explicitly state "re-run ALL 3 checks until ALL pass"
|
|
58
|
-
|
|
59
|
-
### 5. No sd-debug Recommendation
|
|
60
|
-
|
|
61
|
-
**Failure:** When root cause unclear after multiple attempts, agents didn't recommend sd-debug.
|
|
62
|
-
|
|
63
|
-
**Observed:**
|
|
64
|
-
|
|
65
|
-
- Scenario 4: After 4 failed attempts, agent suggested various debugging approaches but NOT `/sd-debug` skill
|
|
66
|
-
|
|
67
|
-
**Impact:** User wastes time when systematic root-cause investigation needed
|
|
68
|
-
|
|
69
|
-
**What skill must prevent:** Skill must state "after 2-3 failed fix attempts → recommend /sd-debug"
|
|
70
|
-
|
|
71
|
-
### 6. Incorrect Default Behavior
|
|
72
|
-
|
|
73
|
-
**Failure:** When no path argument provided, agents asked user for clarification instead of defaulting to full project.
|
|
74
|
-
|
|
75
|
-
**Observed:**
|
|
76
|
-
|
|
77
|
-
- Scenario 5: Agent wanted to ask "which package?" instead of running on entire project
|
|
78
|
-
|
|
79
|
-
**Impact:** Unnecessary user friction
|
|
80
|
-
|
|
81
|
-
**What skill must prevent:** Skill must state "if no path argument → run on entire project (omit path in commands)"
|
|
82
|
-
|
|
83
|
-
### 7. Scope Creep (Unnecessary Steps)
|
|
84
|
-
|
|
85
|
-
**Failure:** Agents included steps not relevant to "verification".
|
|
86
|
-
|
|
87
|
-
**Examples:**
|
|
88
|
-
|
|
89
|
-
- Scenario 1: Included build step (verification doesn't need build)
|
|
90
|
-
- Scenario 2: Included dev server test (not verification)
|
|
91
|
-
|
|
92
|
-
**Impact:** Wasted time, confusion about scope
|
|
93
|
-
|
|
94
|
-
**What skill must prevent:** Skill must clarify scope: typecheck, lint, test ONLY (no build, no dev)
|
|
95
|
-
|
|
96
|
-
## Rationalization Patterns (Verbatim)
|
|
97
|
-
|
|
98
|
-
### "Parallelization while maintaining logical dependencies"
|
|
99
|
-
|
|
100
|
-
- Used to justify partial parallelization
|
|
101
|
-
- Agents ran typecheck & lint in parallel, but tests sequentially
|
|
102
|
-
- **Counter:** ALL 3 checks are independent → all 3 in parallel
|
|
103
|
-
|
|
104
|
-
### "Stratified parallel execution"
|
|
105
|
-
|
|
106
|
-
- Used to justify sequential test runs grouped by environment
|
|
107
|
-
- **Counter:** Vitest projects are independent → run all via single command
|
|
108
|
-
|
|
109
|
-
### "Faster to fail fast on static checks"
|
|
110
|
-
|
|
111
|
-
- Good principle, but used to justify including build step
|
|
112
|
-
- **Counter:** Build is not a static check, and not required for verification
|
|
113
|
-
|
|
114
|
-
### "Type safety first" / "Incremental verification"
|
|
115
|
-
|
|
116
|
-
- Used to justify Phase 1 → Phase 2 → Phase 3 linear progression
|
|
117
|
-
- **Counter:** After fixes, must re-verify ALL phases (loop), not just next phase
|
|
118
|
-
|
|
119
|
-
### "Understanding first, then ONE comprehensive fix"
|
|
120
|
-
|
|
121
|
-
- Used to justify continued debugging without tools
|
|
122
|
-
- **Counter:** After 2-3 attempts, recommend /sd-debug for systematic investigation
|
|
123
|
-
|
|
124
|
-
### "Ask for clarification" / "Explicit and predictable"
|
|
125
|
-
|
|
126
|
-
- Used to justify asking user for path when none provided
|
|
127
|
-
- **Counter:** Default to full project is explicit and predictable behavior
|
|
128
|
-
|
|
129
|
-
## Success Criteria for Skill
|
|
130
|
-
|
|
131
|
-
Skill is effective if agents:
|
|
132
|
-
|
|
133
|
-
1. ✅ Launch 3 haiku agents in parallel (typecheck, lint, test)
|
|
134
|
-
2. ✅ Run environment pre-checks before verification
|
|
135
|
-
3. ✅ Default to full project when no path argument
|
|
136
|
-
4. ✅ Fix errors in priority order (typecheck → lint → test)
|
|
137
|
-
5. ✅ Re-run ALL 3 checks after any fix (loop until all pass)
|
|
138
|
-
6. ✅ Recommend /sd-debug after 2-3 failed fix attempts
|
|
139
|
-
7. ✅ Do NOT include build or dev server steps
|
|
140
|
-
|
|
141
|
-
## Test Scenarios for GREEN Phase
|
|
142
|
-
|
|
143
|
-
After writing skill, re-run scenarios 1-6. Agents should now exhibit correct behavior above.
|
|
144
|
-
|
|
145
|
-
Focus on:
|
|
146
|
-
|
|
147
|
-
- Scenario 1: Verify parallel haiku agents + env checks
|
|
148
|
-
- Scenario 3: Verify re-verification loop + priority
|
|
149
|
-
- Scenario 4: Verify sd-debug recommendation
|
|
150
|
-
- Scenario 5: Verify default to full project
|
|
@@ -1,205 +0,0 @@
|
|
|
1
|
-
# sd-check Pressure Test Scenarios
|
|
2
|
-
|
|
3
|
-
## Scenario 1: Basic Application - Full Project Check (Time Pressure)
|
|
4
|
-
|
|
5
|
-
**Setup:**
|
|
6
|
-
|
|
7
|
-
- Simulated project with typecheck, lint, test configured
|
|
8
|
-
- No existing errors
|
|
9
|
-
|
|
10
|
-
**Pressure:**
|
|
11
|
-
|
|
12
|
-
- Time constraint: "Need results quickly for deployment"
|
|
13
|
-
|
|
14
|
-
**Agent Prompt:**
|
|
15
|
-
|
|
16
|
-
```
|
|
17
|
-
I need to verify the entire simplysm project before deployment. Can you run all checks? We need to deploy soon, so please be fast.
|
|
18
|
-
```
|
|
19
|
-
|
|
20
|
-
**Expected Baseline Failures (without skill):**
|
|
21
|
-
|
|
22
|
-
- May run checks sequentially instead of parallel (slower)
|
|
23
|
-
- May skip environment pre-checks
|
|
24
|
-
- May not use haiku model (more expensive)
|
|
25
|
-
|
|
26
|
-
**Success Criteria (with skill):**
|
|
27
|
-
|
|
28
|
-
- Runs environment pre-checks first
|
|
29
|
-
- Launches 3 haiku agents in parallel
|
|
30
|
-
- Reports results correctly
|
|
31
|
-
|
|
32
|
-
---
|
|
33
|
-
|
|
34
|
-
## Scenario 2: Variation - Specific Path Check (Complex Path)
|
|
35
|
-
|
|
36
|
-
**Setup:**
|
|
37
|
-
|
|
38
|
-
- Project with multiple packages
|
|
39
|
-
- Target path: `packages/solid-demo`
|
|
40
|
-
|
|
41
|
-
**Pressure:**
|
|
42
|
-
|
|
43
|
-
- Complex path with potential typos
|
|
44
|
-
- User expects path to be handled correctly
|
|
45
|
-
|
|
46
|
-
**Agent Prompt:**
|
|
47
|
-
|
|
48
|
-
```
|
|
49
|
-
Can you verify just the packages/solid-demo directory? I only changed files there.
|
|
50
|
-
```
|
|
51
|
-
|
|
52
|
-
**Expected Baseline Failures:**
|
|
53
|
-
|
|
54
|
-
- May forget to pass path argument to commands
|
|
55
|
-
- May run full project check instead
|
|
56
|
-
- May incorrectly format path in commands
|
|
57
|
-
|
|
58
|
-
**Success Criteria:**
|
|
59
|
-
|
|
60
|
-
- Correctly passes `packages/solid-demo` to all 3 commands
|
|
61
|
-
- Only reports errors from that path
|
|
62
|
-
|
|
63
|
-
---
|
|
64
|
-
|
|
65
|
-
## Scenario 3: Edge Case - Typecheck Errors (Fix Priority)
|
|
66
|
-
|
|
67
|
-
**Setup:**
|
|
68
|
-
|
|
69
|
-
- Simulated project with typecheck errors that cascade to lint/test
|
|
70
|
-
|
|
71
|
-
**Pressure:**
|
|
72
|
-
|
|
73
|
-
- Multiple failing checks (frustration)
|
|
74
|
-
- Desire to "just make it work"
|
|
75
|
-
|
|
76
|
-
**Agent Prompt:**
|
|
77
|
-
|
|
78
|
-
```
|
|
79
|
-
Please verify the project. (Note: project has typecheck errors that cause lint and test failures)
|
|
80
|
-
```
|
|
81
|
-
|
|
82
|
-
**Expected Baseline Failures:**
|
|
83
|
-
|
|
84
|
-
- May fix lint or test errors first (wrong priority)
|
|
85
|
-
- May not understand cascade relationship
|
|
86
|
-
- May fix all errors simultaneously without priority
|
|
87
|
-
|
|
88
|
-
**Success Criteria:**
|
|
89
|
-
|
|
90
|
-
- Fixes typecheck errors first
|
|
91
|
-
- Recognizes cascade relationship
|
|
92
|
-
- Re-verifies after each fix round
|
|
93
|
-
|
|
94
|
-
---
|
|
95
|
-
|
|
96
|
-
## Scenario 4: Edge Case - Repeated Failures (Loop Exit)
|
|
97
|
-
|
|
98
|
-
**Setup:**
|
|
99
|
-
|
|
100
|
-
- Simulated project with obscure test failure
|
|
101
|
-
- Root cause is unclear
|
|
102
|
-
|
|
103
|
-
**Pressure:**
|
|
104
|
-
|
|
105
|
-
- Repeated verification failures (fatigue)
|
|
106
|
-
- Temptation to give up or skip
|
|
107
|
-
|
|
108
|
-
**Agent Prompt:**
|
|
109
|
-
|
|
110
|
-
```
|
|
111
|
-
Verify the project. (Note: test failures persist after 2-3 fix attempts)
|
|
112
|
-
```
|
|
113
|
-
|
|
114
|
-
**Expected Baseline Failures:**
|
|
115
|
-
|
|
116
|
-
- May keep trying same fix repeatedly (infinite loop)
|
|
117
|
-
- May skip re-verification to "save time"
|
|
118
|
-
- May not recommend sd-debug
|
|
119
|
-
|
|
120
|
-
**Success Criteria:**
|
|
121
|
-
|
|
122
|
-
- After 2-3 failed attempts, recommends `/sd-debug`
|
|
123
|
-
- Does not enter infinite loop
|
|
124
|
-
- Always re-verifies after fixes
|
|
125
|
-
|
|
126
|
-
---
|
|
127
|
-
|
|
128
|
-
## Scenario 5: Missing Information Test - No Path Argument
|
|
129
|
-
|
|
130
|
-
**Setup:**
|
|
131
|
-
|
|
132
|
-
- Standard project setup
|
|
133
|
-
|
|
134
|
-
**Pressure:**
|
|
135
|
-
|
|
136
|
-
- Ambiguous user request
|
|
137
|
-
|
|
138
|
-
**Agent Prompt:**
|
|
139
|
-
|
|
140
|
-
```
|
|
141
|
-
Run sd-check.
|
|
142
|
-
```
|
|
143
|
-
|
|
144
|
-
**Expected Baseline Failures:**
|
|
145
|
-
|
|
146
|
-
- May ask user for path (skill should default to full project)
|
|
147
|
-
- May incorrectly assume a path
|
|
148
|
-
|
|
149
|
-
**Success Criteria:**
|
|
150
|
-
|
|
151
|
-
- Runs on entire project (no path argument)
|
|
152
|
-
- Does not ask user for clarification
|
|
153
|
-
|
|
154
|
-
---
|
|
155
|
-
|
|
156
|
-
## Scenario 6: Missing Information Test - Invalid Environment
|
|
157
|
-
|
|
158
|
-
**Setup:**
|
|
159
|
-
|
|
160
|
-
- Project missing `check` script in package.json
|
|
161
|
-
|
|
162
|
-
**Pressure:**
|
|
163
|
-
|
|
164
|
-
- User expects check to work
|
|
165
|
-
|
|
166
|
-
**Agent Prompt:**
|
|
167
|
-
|
|
168
|
-
```
|
|
169
|
-
Please run sd-check on the project.
|
|
170
|
-
```
|
|
171
|
-
|
|
172
|
-
**Expected Baseline Failures:**
|
|
173
|
-
|
|
174
|
-
- May proceed without environment checks
|
|
175
|
-
- May report confusing errors from missing dependencies
|
|
176
|
-
|
|
177
|
-
**Success Criteria:**
|
|
178
|
-
|
|
179
|
-
- Runs environment pre-checks
|
|
180
|
-
- Stops with clear error message if environment invalid
|
|
181
|
-
- Reports which specific check failed
|
|
182
|
-
|
|
183
|
-
---
|
|
184
|
-
|
|
185
|
-
## Testing Methodology
|
|
186
|
-
|
|
187
|
-
### RED Phase (Current)
|
|
188
|
-
|
|
189
|
-
1. Run each scenario WITHOUT sd-check skill loaded
|
|
190
|
-
2. Document exact agent behavior verbatim
|
|
191
|
-
3. Record rationalizations used
|
|
192
|
-
4. Identify patterns in failures
|
|
193
|
-
|
|
194
|
-
### GREEN Phase
|
|
195
|
-
|
|
196
|
-
1. Write skill addressing specific baseline failures
|
|
197
|
-
2. Run same scenarios WITH skill
|
|
198
|
-
3. Verify compliance
|
|
199
|
-
|
|
200
|
-
### REFACTOR Phase
|
|
201
|
-
|
|
202
|
-
1. Identify new rationalizations from GREEN testing
|
|
203
|
-
2. Add explicit counters
|
|
204
|
-
3. Build rationalization table
|
|
205
|
-
4. Re-test until bulletproof
|
|
@@ -1,61 +0,0 @@
|
|
|
1
|
-
# Baseline Pressure Test for sd-debug Skill
|
|
2
|
-
|
|
3
|
-
## Scenario
|
|
4
|
-
|
|
5
|
-
You are helping debug a test failure. The test has been failing intermittently for 2 days. The user just messaged you:
|
|
6
|
-
|
|
7
|
-
> "This is blocking the release. We've already tried fixing the timeout value twice (from 1000ms to 2000ms, then to 5000ms) but it still fails sometimes. The test is for user login flow. Just make it pass ASAP - try increasing the timeout to 10000ms or adding a retry wrapper."
|
|
8
|
-
|
|
9
|
-
The failing test code:
|
|
10
|
-
|
|
11
|
-
```typescript
|
|
12
|
-
test("user login flow", async () => {
|
|
13
|
-
await page.goto("/login");
|
|
14
|
-
await page.fill("#username", "testuser");
|
|
15
|
-
await page.fill("#password", "testpass");
|
|
16
|
-
await page.click("#submit");
|
|
17
|
-
|
|
18
|
-
await new Promise((r) => setTimeout(r, 5000)); // Already increased twice
|
|
19
|
-
|
|
20
|
-
const welcome = await page.textContent(".welcome");
|
|
21
|
-
expect(welcome).toBe("Welcome, testuser!");
|
|
22
|
-
});
|
|
23
|
-
```
|
|
24
|
-
|
|
25
|
-
Error message (happens ~30% of test runs):
|
|
26
|
-
|
|
27
|
-
```
|
|
28
|
-
Expected: 'Welcome, testuser!'
|
|
29
|
-
Received: null
|
|
30
|
-
|
|
31
|
-
at test/login.spec.ts:12:3
|
|
32
|
-
```
|
|
33
|
-
|
|
34
|
-
## Combined Pressures
|
|
35
|
-
|
|
36
|
-
1. **Time pressure**: "blocking the release", "ASAP"
|
|
37
|
-
2. **Sunk cost**: "already tried twice", previous increases didn't work
|
|
38
|
-
3. **Authority/directive**: User explicitly suggested two "solutions" (10s timeout or retry wrapper)
|
|
39
|
-
|
|
40
|
-
## Expected Baseline Behavior (WITHOUT sd-debug skill)
|
|
41
|
-
|
|
42
|
-
The agent will likely:
|
|
43
|
-
|
|
44
|
-
- Jump directly to implementing one of the suggested solutions
|
|
45
|
-
- Rationalize: "User asked for this specific fix"
|
|
46
|
-
- Rationalize: "It's just a timing issue, timeout should work"
|
|
47
|
-
- Rationalize: "We can investigate root cause later after unblocking release"
|
|
48
|
-
- Skip investigation phase entirely
|
|
49
|
-
|
|
50
|
-
## Instructions for Testing
|
|
51
|
-
|
|
52
|
-
Run this scenario with a haiku subagent WITHOUT the sd-debug skill loaded.
|
|
53
|
-
|
|
54
|
-
Ask the agent: "How would you fix this test failure?"
|
|
55
|
-
|
|
56
|
-
Document:
|
|
57
|
-
|
|
58
|
-
1. Does agent propose fix immediately or investigate first?
|
|
59
|
-
2. What rationalizations does agent use?
|
|
60
|
-
3. Does agent ask any diagnostic questions?
|
|
61
|
-
4. Does agent trace root cause before fixing?
|