thought-cabinet 0.2.0 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +29 -7
- package/dist/index.js +88 -79
- package/dist/index.js.map +1 -1
- package/package.json +1 -1
- package/src/agent-assets/skills/creating-plan/SKILL.md +6 -0
- package/src/agent-assets/skills/creating-plan/plan-template.md +37 -12
- package/src/agent-assets/skills/implementing-plan/SKILL.md +30 -3
- package/src/agent-assets/skills/onboard/SKILL.md +74 -0
- package/src/agent-assets/skills/onboard/onboard.sh +118 -0
- package/src/agent-assets/skills/test-skill-e2e/SKILL.md +205 -0
package/package.json
CHANGED
|
@@ -144,6 +144,12 @@ After structure approval:
|
|
|
144
144
|
|
|
145
145
|
2. **Write plan** using [plan-template.md](plan-template.md)
|
|
146
146
|
- **MUST** Read the template and follow the structure exactly.
|
|
147
|
+
- **TDD compatibility check**: For every change block, verify:
|
|
148
|
+
- `Testable Behaviors` appears **before** `Reference Implementation`
|
|
149
|
+
- Each testable behavior bullet is specific enough to write a failing test from (includes input, condition, and expected output/behavior)
|
|
150
|
+
- Each bullet maps to exactly one test — split compound behaviors
|
|
151
|
+
- The code block is labeled "Reference Implementation", not "Code to write"
|
|
152
|
+
- If a change block has no conditional logic, no data transformation, and is a pure pass-through, it may omit testable behaviors — document why.
|
|
147
153
|
|
|
148
154
|
3. **Sync thoughts directory**:
|
|
149
155
|
```bash
|
|
@@ -50,12 +50,24 @@
|
|
|
50
50
|
|
|
51
51
|
**File**: `path/to/file.ext`
|
|
52
52
|
**Changes**: [Summary of changes]
|
|
53
|
-
|
|
53
|
+
|
|
54
|
+
##### Testable Behaviors (RED tests)
|
|
55
|
+
|
|
56
|
+
> Each bullet is one TDD RED test. `implementing-plan` writes each test first, watches it fail, then writes the minimal code to pass it.
|
|
57
|
+
|
|
58
|
+
- [Input/condition] → [expected output/behavior]
|
|
59
|
+
- [Edge case] → [expected behavior]
|
|
60
|
+
- [Error case] → [expected fallback]
|
|
61
|
+
|
|
62
|
+
##### Reference Implementation
|
|
54
63
|
|
|
55
64
|
```[language]
|
|
56
|
-
//
|
|
65
|
+
// Suggested implementation — written AFTER the RED tests pass.
|
|
66
|
+
// implementing-plan must not read this before writing the failing tests.
|
|
57
67
|
```
|
|
58
68
|
|
|
69
|
+
---
|
|
70
|
+
|
|
59
71
|
### Success Criteria:
|
|
60
72
|
|
|
61
73
|
#### Automated Verification:
|
|
@@ -81,18 +93,11 @@
|
|
|
81
93
|
|
|
82
94
|
---
|
|
83
95
|
|
|
84
|
-
## Testing
|
|
85
|
-
|
|
86
|
-
### Unit Tests:
|
|
87
|
-
|
|
88
|
-
- [What to test]
|
|
89
|
-
- [Key edge cases]
|
|
96
|
+
## Integration Testing
|
|
90
97
|
|
|
91
|
-
|
|
98
|
+
[End-to-end scenarios that require multiple components working together — not covered by unit tests above]
|
|
92
99
|
|
|
93
|
-
|
|
94
|
-
|
|
95
|
-
### Manual Testing Steps:
|
|
100
|
+
## Manual Testing Steps
|
|
96
101
|
|
|
97
102
|
1. [Specific verification step]
|
|
98
103
|
2. [Edge case to test manually]
|
|
@@ -126,6 +131,26 @@ Always separate into two categories:
|
|
|
126
131
|
- Performance under real conditions
|
|
127
132
|
- User acceptance criteria
|
|
128
133
|
|
|
134
|
+
## TDD Compatibility Requirements
|
|
135
|
+
|
|
136
|
+
When writing each change block, ask:
|
|
137
|
+
|
|
138
|
+
1. **Are the testable behaviors specific enough to write a failing test from?**
|
|
139
|
+
- Bad: "handles null input"
|
|
140
|
+
- Good: "`envCreateTime=null` with cutoff set → returns `false` (safe fallback)"
|
|
141
|
+
|
|
142
|
+
2. **Is the behavior written before the code block?**
|
|
143
|
+
- The testable behaviors section must appear before the reference implementation.
|
|
144
|
+
- The implementer reads behaviors first and writes the RED test before reading the code.
|
|
145
|
+
|
|
146
|
+
3. **Does each bullet map to exactly one test?**
|
|
147
|
+
- Compound behaviors (A and B) → split into two bullets.
|
|
148
|
+
- Each bullet = one `def "..."()` / `it(...)` / `test(...)`.
|
|
149
|
+
|
|
150
|
+
4. **Is the code block labeled "Reference Implementation"?**
|
|
151
|
+
- Never label it "Code to write" or "Implementation".
|
|
152
|
+
- The label signals it is consulted only after RED → GREEN, not before.
|
|
153
|
+
|
|
129
154
|
## Common Patterns
|
|
130
155
|
|
|
131
156
|
### Database Changes:
|
|
@@ -90,9 +90,36 @@ How should I proceed?
|
|
|
90
90
|
Before writing any production code for a phase:
|
|
91
91
|
|
|
92
92
|
1. Read existing test files for the modules being changed (if not already read in Getting Started)
|
|
93
|
-
2.
|
|
94
|
-
3.
|
|
95
|
-
|
|
93
|
+
2. For each change block in the phase, read only the **Testable Behaviors** section — do NOT read the Reference Implementation yet
|
|
94
|
+
3. For each testable behavior bullet, execute one RED-GREEN-REFACTOR cycle:
|
|
95
|
+
- **RED**: Write one failing test for that behavior. Run it. Confirm it fails for the right reason.
|
|
96
|
+
- **GREEN**: Write the minimal production code to pass it. Run it. Confirm it passes.
|
|
97
|
+
- **REFACTOR**: Clean up. Run tests. Stay green.
|
|
98
|
+
4. Only after all behavior bullets have passing tests, read the Reference Implementation and reconcile — adjust your implementation if it diverges from the plan's intent, but do not delete passing tests.
|
|
99
|
+
5. Proceed to the phase completion checklist.
|
|
100
|
+
|
|
101
|
+
### How to Extract Work Items from a Plan Change Block
|
|
102
|
+
|
|
103
|
+
A change block looks like:
|
|
104
|
+
|
|
105
|
+
```
|
|
106
|
+
##### Testable Behaviors (RED tests)
|
|
107
|
+
- `cutoff empty` → `isEnvCreatedAfterCutoff` returns `true`
|
|
108
|
+
- `createTime after cutoff` → returns `true`
|
|
109
|
+
- `createTime before cutoff` → returns `false`
|
|
110
|
+
- `createTime null + cutoff set` → returns `false` (safe fallback)
|
|
111
|
+
|
|
112
|
+
##### Reference Implementation
|
|
113
|
+
[code]
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
Map this to a work queue:
|
|
117
|
+
1. `def "isEnvCreatedAfterCutoff: cutoff empty returns true"()` → RED → GREEN
|
|
118
|
+
2. `def "isEnvCreatedAfterCutoff: createTime after cutoff returns true"()` → RED → GREEN
|
|
119
|
+
3. `def "isEnvCreatedAfterCutoff: createTime before cutoff returns false"()` → RED → GREEN
|
|
120
|
+
4. `def "isEnvCreatedAfterCutoff: null createTime with cutoff set returns false"()` → RED → GREEN
|
|
121
|
+
|
|
122
|
+
Each bullet is one test. Complete all cycles for this change block before moving to the next.
|
|
96
123
|
|
|
97
124
|
## Phase Completion Checklist
|
|
98
125
|
|
|
@@ -0,0 +1,74 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: onboard
|
|
3
|
+
description: Onboard an AI agent to a new project by initializing ThoughtCabinet thoughts repo and bootstrapping agent memory. Use when starting work on a new repository, setting up a fresh project for AI-assisted development, or when the user asks to onboard, bootstrap, or initialize a project.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Onboarding a New Project
|
|
7
|
+
|
|
8
|
+
Set up a new project for AI-assisted development: initialize the thoughts repo and bootstrap agent memory in one workflow.
|
|
9
|
+
|
|
10
|
+
## Workflow Context
|
|
11
|
+
|
|
12
|
+
This skill orchestrates two capabilities that are normally run separately:
|
|
13
|
+
- `thc init` — connects the project to a thoughts repo
|
|
14
|
+
- `init-agent-memory` skill — creates AGENTS.md and supporting docs
|
|
15
|
+
|
|
16
|
+
After onboarding, the project is ready for skills like `creating-plan`, `research-codebase`, and `implementing-plan`.
|
|
17
|
+
|
|
18
|
+
## Workflow Overview
|
|
19
|
+
|
|
20
|
+
1. **Pre-flight + Initialize thoughts** - Run `onboard.sh`: check environment, run `thc init`
|
|
21
|
+
2. **Bootstrap agent memory** - Invoke `init-agent-memory` skill (if needed)
|
|
22
|
+
3. **Verify** - Run `onboard.sh --verify-only`: confirm everything is wired up
|
|
23
|
+
|
|
24
|
+
## Step 1: Pre-flight and Initialize Thoughts
|
|
25
|
+
|
|
26
|
+
Run: `bash onboard.sh`
|
|
27
|
+
|
|
28
|
+
**Exit codes determine next action**:
|
|
29
|
+
- **1** — Fatal error (not a git repo, thc missing, init failed). Stop and report.
|
|
30
|
+
- **2** — Thoughts ready, AGENTS.md not found. Proceed to Step 2.
|
|
31
|
+
- **3** — Thoughts + AGENTS.md both exist. Ask user if they want to regenerate memory or skip to Step 3.
|
|
32
|
+
|
|
33
|
+
If thoughts was already initialized and user wants to re-initialize:
|
|
34
|
+
```bash
|
|
35
|
+
bash onboard.sh --force
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
## Step 2: Bootstrap Agent Memory
|
|
39
|
+
|
|
40
|
+
**If AGENTS.md already exists**: Ask the user whether to regenerate or skip.
|
|
41
|
+
|
|
42
|
+
**If AGENTS.md does not exist**: Invoke the `init-agent-memory` skill.
|
|
43
|
+
|
|
44
|
+
**Note**: `thoughts/CLAUDE.md` (from Step 1) and root `CLAUDE.md` (from this step) serve different purposes:
|
|
45
|
+
- `thoughts/CLAUDE.md` — thoughts directory usage rules
|
|
46
|
+
- `./CLAUDE.md` — project memory for the AI agent (symlink to AGENTS.md)
|
|
47
|
+
|
|
48
|
+
## Step 3: Verify and Present
|
|
49
|
+
|
|
50
|
+
Run: `bash onboard.sh --verify-only`
|
|
51
|
+
|
|
52
|
+
Present results to user:
|
|
53
|
+
|
|
54
|
+
```
|
|
55
|
+
Project onboarding complete!
|
|
56
|
+
|
|
57
|
+
- thoughts/ connected to [thoughts repo path]
|
|
58
|
+
- AGENTS.md created with project context
|
|
59
|
+
- Git hooks installed (auto-sync on commit)
|
|
60
|
+
|
|
61
|
+
You're ready to use skills like /creating-plan, /research-codebase, and /implementing-plan.
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
If any step was skipped or failed, note it clearly with suggested remediation.
|
|
65
|
+
|
|
66
|
+
## Guidelines
|
|
67
|
+
|
|
68
|
+
**Be incremental**: Re-running the skill should be safe — each step skips if already done.
|
|
69
|
+
|
|
70
|
+
**Fail fast**: If a critical step fails, stop and report rather than continuing with a broken setup.
|
|
71
|
+
|
|
72
|
+
**Minimal prompting**: Only ask questions when the answer cannot be inferred from the environment.
|
|
73
|
+
|
|
74
|
+
**Respect existing work**: Never overwrite AGENTS.md, CLAUDE.md, or thoughts/ without explicit user confirmation.
|
|
@@ -0,0 +1,118 @@
|
|
|
1
|
+
#!/usr/bin/env bash
|
|
2
|
+
set -euo pipefail
|
|
3
|
+
|
|
4
|
+
# onboard.sh — Pre-flight checks, thoughts init, and verification for the onboard skill.
|
|
5
|
+
#
|
|
6
|
+
# Usage: bash onboard.sh [--force] [--verify-only]
|
|
7
|
+
# --force Re-initialize thoughts even if already set up
|
|
8
|
+
# --verify-only Skip init, only run verification
|
|
9
|
+
#
|
|
10
|
+
# Exit codes:
|
|
11
|
+
# 0 Success (init completed or already set up)
|
|
12
|
+
# 1 Fatal error (not a git repo, thc missing, init failed)
|
|
13
|
+
# 2 Thoughts already initialized, AGENTS.md not found (agent memory needed)
|
|
14
|
+
# 3 Thoughts already initialized, AGENTS.md exists (fully onboarded)
|
|
15
|
+
|
|
16
|
+
FORCE=false
|
|
17
|
+
VERIFY_ONLY=false
|
|
18
|
+
for arg in "$@"; do
|
|
19
|
+
case "$arg" in
|
|
20
|
+
--force) FORCE=true ;;
|
|
21
|
+
--verify-only) VERIFY_ONLY=true ;;
|
|
22
|
+
esac
|
|
23
|
+
done
|
|
24
|
+
|
|
25
|
+
# --- Helpers ---
|
|
26
|
+
|
|
27
|
+
resolve_thc() {
|
|
28
|
+
if command -v thc > /dev/null 2>&1; then
|
|
29
|
+
echo "thc"
|
|
30
|
+
elif command -v thoughtcabinet > /dev/null 2>&1; then
|
|
31
|
+
echo "thoughtcabinet"
|
|
32
|
+
else
|
|
33
|
+
echo ""
|
|
34
|
+
fi
|
|
35
|
+
}
|
|
36
|
+
|
|
37
|
+
check_status() {
|
|
38
|
+
[ -L thoughts/shared ] && THOUGHTS=true || THOUGHTS=false
|
|
39
|
+
[ -f AGENTS.md ] && MEMORY=true || MEMORY=false
|
|
40
|
+
|
|
41
|
+
echo "thoughts: $([ "$THOUGHTS" = true ] && echo initialized || echo 'not initialized')"
|
|
42
|
+
echo "memory: $([ "$MEMORY" = true ] && echo exists || echo 'not found')"
|
|
43
|
+
}
|
|
44
|
+
|
|
45
|
+
verify() {
|
|
46
|
+
echo "=== Onboarding Status ==="
|
|
47
|
+
local issues=0
|
|
48
|
+
|
|
49
|
+
if [ -L thoughts/shared ] && [ -L thoughts/global ]; then
|
|
50
|
+
echo "[OK] thoughts/ initialized"
|
|
51
|
+
else
|
|
52
|
+
echo "[FAIL] thoughts/ not initialized"
|
|
53
|
+
issues=$((issues + 1))
|
|
54
|
+
fi
|
|
55
|
+
|
|
56
|
+
[ -f AGENTS.md ] && echo "[OK] AGENTS.md created" || echo "[SKIP] AGENTS.md not created"
|
|
57
|
+
|
|
58
|
+
if [ -L CLAUDE.md ]; then echo "[OK] CLAUDE.md symlink"
|
|
59
|
+
elif [ -f CLAUDE.md ]; then echo "[OK] CLAUDE.md exists"
|
|
60
|
+
else echo "[SKIP] CLAUDE.md not created"; fi
|
|
61
|
+
|
|
62
|
+
local git_dir
|
|
63
|
+
git_dir=$(git rev-parse --git-common-dir 2>/dev/null)
|
|
64
|
+
[ -f "$git_dir/hooks/pre-commit" ] && echo "[OK] pre-commit hook" || echo "[WARN] no pre-commit hook"
|
|
65
|
+
[ -f "$git_dir/hooks/post-commit" ] && echo "[OK] post-commit hook" || echo "[WARN] no post-commit hook"
|
|
66
|
+
|
|
67
|
+
echo "=== Done ==="
|
|
68
|
+
return "$issues"
|
|
69
|
+
}
|
|
70
|
+
|
|
71
|
+
# --- Main ---
|
|
72
|
+
|
|
73
|
+
# Pre-flight: git repo
|
|
74
|
+
if ! git rev-parse --git-dir > /dev/null 2>&1; then
|
|
75
|
+
echo "FATAL: Not a git repository. Run 'git init' first."
|
|
76
|
+
exit 1
|
|
77
|
+
fi
|
|
78
|
+
|
|
79
|
+
# Pre-flight: thc availability
|
|
80
|
+
THC_CMD=$(resolve_thc)
|
|
81
|
+
if [ -z "$THC_CMD" ]; then
|
|
82
|
+
echo "FATAL: thc is not installed or not in PATH."
|
|
83
|
+
exit 1
|
|
84
|
+
fi
|
|
85
|
+
|
|
86
|
+
# Verify-only mode
|
|
87
|
+
if [ "$VERIFY_ONLY" = true ]; then
|
|
88
|
+
verify
|
|
89
|
+
exit $?
|
|
90
|
+
fi
|
|
91
|
+
|
|
92
|
+
# Status check
|
|
93
|
+
check_status
|
|
94
|
+
|
|
95
|
+
# Initialize thoughts
|
|
96
|
+
if [ "$THOUGHTS" = true ] && [ "$FORCE" = false ]; then
|
|
97
|
+
echo "SKIP: thoughts/ already initialized."
|
|
98
|
+
if [ "$MEMORY" = true ]; then
|
|
99
|
+
exit 3
|
|
100
|
+
else
|
|
101
|
+
exit 2
|
|
102
|
+
fi
|
|
103
|
+
fi
|
|
104
|
+
|
|
105
|
+
INIT_FLAGS="--directory $(basename "$(pwd)")"
|
|
106
|
+
[ "$FORCE" = true ] && INIT_FLAGS="$INIT_FLAGS --force"
|
|
107
|
+
$THC_CMD init $INIT_FLAGS
|
|
108
|
+
|
|
109
|
+
# Verify init succeeded
|
|
110
|
+
if [ -L thoughts/shared ] && [ -L thoughts/global ]; then
|
|
111
|
+
echo "OK: thoughts/ initialized."
|
|
112
|
+
else
|
|
113
|
+
echo "FATAL: thoughts/ init failed — symlinks not created."
|
|
114
|
+
exit 1
|
|
115
|
+
fi
|
|
116
|
+
|
|
117
|
+
# Return based on memory status
|
|
118
|
+
[ -f AGENTS.md ] && exit 3 || exit 2
|
|
@@ -0,0 +1,205 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: test-skill-e2e
|
|
3
|
+
description: End-to-end smoke test a ThoughtCabinet skill by deploying it to a target agent, running the agent non-interactively against a test project, capturing output, and evaluating results. Use when you want to verify a skill works correctly before shipping.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# End-to-End Skill Testing
|
|
7
|
+
|
|
8
|
+
Smoke test a ThoughtCabinet skill by deploying it to a target agent CLI, invoking the agent non-interactively against a real project, capturing all output, and evaluating results.
|
|
9
|
+
|
|
10
|
+
## Workflow Overview
|
|
11
|
+
|
|
12
|
+
1. **Gather inputs** - Determine which skill, which agent, and which project to test against
|
|
13
|
+
2. **Deploy skills** - Copy all bundled skills into the agent's project-level skill directory
|
|
14
|
+
3. **Prepare test environment** - Clean up artifacts from prior runs
|
|
15
|
+
4. **Execute agent** - Run the agent CLI non-interactively with the skill prompt
|
|
16
|
+
5. **Evaluate results** - Read captured output and generate a pass/fail summary
|
|
17
|
+
|
|
18
|
+
## Step 1: Gather Inputs
|
|
19
|
+
|
|
20
|
+
Determine three things from the user or surrounding context:
|
|
21
|
+
|
|
22
|
+
| Input | Example |
|
|
23
|
+
|-------|---------|
|
|
24
|
+
| Skill to test | `onboard`, or path like `src/agent-assets/skills/onboard/SKILL.md` |
|
|
25
|
+
| Agent CLI | `codex`, `claude` |
|
|
26
|
+
|
|
27
|
+
Once identified, read the target skill's SKILL.md to understand:
|
|
28
|
+
- What the skill does (description, workflow steps)
|
|
29
|
+
- What artifacts it creates (files, directories, symlinks)
|
|
30
|
+
- What verification checks it performs (look for `[OK]`/`[FAIL]` markers)
|
|
31
|
+
|
|
32
|
+
```bash
|
|
33
|
+
# Example: read the skill under test
|
|
34
|
+
cat src/agent-assets/skills/<skill-name>/SKILL.md
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
This understanding is essential for Steps 3 and 5.
|
|
38
|
+
|
|
39
|
+
## Step 2: Deploy Skills
|
|
40
|
+
|
|
41
|
+
Copy **all** bundled skills from the ThoughtCabinet source tree into the agent's project-level skill directory. Include all skills (not just the one under test) because skills may reference each other.
|
|
42
|
+
|
|
43
|
+
Agent skill directory conventions:
|
|
44
|
+
|
|
45
|
+
| Agent | Directory |
|
|
46
|
+
|-------|-----------|
|
|
47
|
+
| Codex | `<project>/.codex/skills/<skill-name>/` |
|
|
48
|
+
| Claude Code | `<project>/.claude/skills/<skill-name>/` |
|
|
49
|
+
| Cline | `<project>/.cline/skills/<skill-name>/` |
|
|
50
|
+
|
|
51
|
+
```bash
|
|
52
|
+
# Example: deploy all skills for Codex
|
|
53
|
+
THC_SRC="/path/to/thought-cabinet/src/agent-assets/skills"
|
|
54
|
+
PROJECT="/path/to/test-project"
|
|
55
|
+
AGENT_SKILLS="$PROJECT/.codex/skills"
|
|
56
|
+
|
|
57
|
+
for skill_dir in "$THC_SRC"/*/; do
|
|
58
|
+
skill_name=$(basename "$skill_dir")
|
|
59
|
+
mkdir -p "$AGENT_SKILLS/$skill_name"
|
|
60
|
+
cp "$skill_dir"SKILL.md "$AGENT_SKILLS/$skill_name/SKILL.md"
|
|
61
|
+
done
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
Verify deployment:
|
|
65
|
+
|
|
66
|
+
```bash
|
|
67
|
+
ls -la "$AGENT_SKILLS"
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
## Step 3: Prepare Test Environment
|
|
71
|
+
|
|
72
|
+
Clean up artifacts from prior runs so the test starts from a known state. Use the skill's SKILL.md (read in Step 1) to determine what to remove.
|
|
73
|
+
|
|
74
|
+
Common cleanup actions by skill type:
|
|
75
|
+
|
|
76
|
+
```bash
|
|
77
|
+
cd "$PROJECT"
|
|
78
|
+
|
|
79
|
+
# For onboard skill
|
|
80
|
+
thc destroy --force 2>/dev/null || true
|
|
81
|
+
rm -f AGENTS.md
|
|
82
|
+
rm -f CLAUDE.md
|
|
83
|
+
rm -rf docs/architectural-patterns.md
|
|
84
|
+
|
|
85
|
+
# For init-agent-memory skill
|
|
86
|
+
rm -f AGENTS.md
|
|
87
|
+
rm -f CLAUDE.md
|
|
88
|
+
rm -rf docs/architectural-patterns.md
|
|
89
|
+
|
|
90
|
+
# Generic: remove thoughts artifacts
|
|
91
|
+
rm -rf thoughts/
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
Verify the starting state is clean:
|
|
95
|
+
|
|
96
|
+
```bash
|
|
97
|
+
echo "=== Pre-test state ==="
|
|
98
|
+
[ ! -d thoughts ] && echo "[OK] no thoughts/" || echo "[WARN] thoughts/ still exists"
|
|
99
|
+
[ ! -f AGENTS.md ] && echo "[OK] no AGENTS.md" || echo "[WARN] AGENTS.md still exists"
|
|
100
|
+
[ ! -f CLAUDE.md ] && echo "[OK] no CLAUDE.md" || echo "[WARN] CLAUDE.md still exists"
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
All checks should show `[OK]` before proceeding.
|
|
104
|
+
|
|
105
|
+
## Step 4: Execute Agent
|
|
106
|
+
|
|
107
|
+
Run the agent CLI non-interactively with full permissions, instructing it to invoke the skill. Stream all output to a timestamped log file via `tee`.
|
|
108
|
+
|
|
109
|
+
### Build the prompt
|
|
110
|
+
|
|
111
|
+
The prompt should instruct the agent to invoke the target skill:
|
|
112
|
+
|
|
113
|
+
```
|
|
114
|
+
Invoke the <skill-name> skill to <summary of what the skill does>.
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
### Agent invocation patterns
|
|
118
|
+
|
|
119
|
+
| Agent | Command |
|
|
120
|
+
|-------|---------|
|
|
121
|
+
| Codex | `codex exec --dangerously-bypass-approvals-and-sandbox "<prompt>"` |
|
|
122
|
+
| Claude Code | `claude -p "<prompt>" --dangerously-skip-permissions` |
|
|
123
|
+
|
|
124
|
+
### Run with output capture
|
|
125
|
+
|
|
126
|
+
```bash
|
|
127
|
+
LOGFILE="test-$(date +%Y%m%d-%H%M%S)-<skill-name>.log"
|
|
128
|
+
|
|
129
|
+
# Codex example
|
|
130
|
+
codex exec --dangerously-bypass-approvals-and-sandbox \
|
|
131
|
+
"Invoke the onboard skill to initialize ThoughtCabinet and bootstrap agent memory." \
|
|
132
|
+
2>&1 | tee "$LOGFILE"
|
|
133
|
+
|
|
134
|
+
# Claude Code example
|
|
135
|
+
claude -p \
|
|
136
|
+
"Invoke the onboard skill to initialize ThoughtCabinet and bootstrap agent memory." \
|
|
137
|
+
--dangerously-skip-permissions \
|
|
138
|
+
2>&1 | tee "$LOGFILE"
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
Wait for the agent to complete. Do not interrupt it.
|
|
142
|
+
|
|
143
|
+
## Step 5: Evaluate Results
|
|
144
|
+
|
|
145
|
+
Read the log file end-to-end and assess:
|
|
146
|
+
|
|
147
|
+
```bash
|
|
148
|
+
cat "$LOGFILE"
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
### Evaluation checklist
|
|
152
|
+
|
|
153
|
+
| Check | What to look for |
|
|
154
|
+
|-------|-----------------|
|
|
155
|
+
| Skill discovery | Agent found and read the SKILL.md |
|
|
156
|
+
| Step execution | Each numbered workflow step was attempted |
|
|
157
|
+
| No errors | No stack traces, permission blocks, or sandbox violations |
|
|
158
|
+
| Artifact creation | Expected files/directories exist on disk |
|
|
159
|
+
| Verification markers | `[OK]` markers in output; no `[FAIL]` markers |
|
|
160
|
+
|
|
161
|
+
### Verify artifacts on disk
|
|
162
|
+
|
|
163
|
+
```bash
|
|
164
|
+
echo "=== Post-test verification ==="
|
|
165
|
+
# Adapt these checks to the skill under test
|
|
166
|
+
|
|
167
|
+
# For onboard skill
|
|
168
|
+
[ -L thoughts/shared ] && echo "[OK] thoughts/shared symlink" || echo "[FAIL] thoughts/shared missing"
|
|
169
|
+
[ -L thoughts/global ] && echo "[OK] thoughts/global symlink" || echo "[FAIL] thoughts/global missing"
|
|
170
|
+
[ -f AGENTS.md ] && echo "[OK] AGENTS.md exists" || echo "[FAIL] AGENTS.md missing"
|
|
171
|
+
[ -e CLAUDE.md ] && echo "[OK] CLAUDE.md exists" || echo "[FAIL] CLAUDE.md missing"
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
### Generate summary
|
|
175
|
+
|
|
176
|
+
Produce a clear pass/fail report:
|
|
177
|
+
|
|
178
|
+
```
|
|
179
|
+
=== Test Results: <skill-name> ===
|
|
180
|
+
Agent: <agent-name>
|
|
181
|
+
Project: <project-path>
|
|
182
|
+
Log: <logfile-path>
|
|
183
|
+
|
|
184
|
+
Step 1 (Pre-flight): PASS
|
|
185
|
+
Step 2 (Init thoughts): PASS
|
|
186
|
+
Step 3 (Bootstrap memory): PASS
|
|
187
|
+
Step 4 (Verify): PASS
|
|
188
|
+
|
|
189
|
+
Overall: PASS (4/4 steps)
|
|
190
|
+
Notes: <any observations>
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
If any step failed, include the relevant log excerpt and suggested remediation.
|
|
194
|
+
|
|
195
|
+
## Guidelines
|
|
196
|
+
|
|
197
|
+
**Idempotent cleanup**: The prepare step (Step 3) should make the test fully repeatable. Running the test twice in a row should produce the same result.
|
|
198
|
+
|
|
199
|
+
**Log everything**: Always capture output to a file. Agent output in terminals can scroll away or be lost.
|
|
200
|
+
|
|
201
|
+
**Read the skill first**: Understanding expected outcomes (Step 1) is essential for meaningful evaluation (Step 5). Do not skip this.
|
|
202
|
+
|
|
203
|
+
**One skill per run**: Test skills individually for clear signal. If you need to test multiple skills, run this workflow once per skill.
|
|
204
|
+
|
|
205
|
+
**Adapt checks to the skill**: The cleanup and verification examples above are for the `onboard` skill. For other skills, derive the appropriate checks from the skill's SKILL.md.
|