@newsails/veil-cli 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (199) hide show
  1. package/.veil/agents/analyst/AGENT.md +21 -0
  2. package/.veil/agents/analyst/agent.json +23 -0
  3. package/.veil/agents/assistant/AGENT.md +15 -0
  4. package/.veil/agents/assistant/agent.json +19 -0
  5. package/.veil/agents/coder/AGENT.md +18 -0
  6. package/.veil/agents/coder/agent.json +19 -0
  7. package/.veil/agents/hello/AGENT.md +5 -0
  8. package/.veil/agents/hello/agent.json +13 -0
  9. package/.veil/agents/writer/AGENT.md +12 -0
  10. package/.veil/agents/writer/agent.json +17 -0
  11. package/.veil/memory/MEMORY.md +343 -0
  12. package/.veil/memory/agents/analyst/MEMORY.md +55 -0
  13. package/.veil/memory/agents/hello/MEMORY.md +12 -0
  14. package/.veil/runtime.pid +1 -0
  15. package/.veil/settings.json +10 -0
  16. package/.veil-studio/studio.db +0 -0
  17. package/.veil-studio/studio.db-shm +0 -0
  18. package/.veil-studio/studio.db-wal +0 -0
  19. package/PLAN/01-vision.md +26 -0
  20. package/PLAN/02-tech-stack.md +94 -0
  21. package/PLAN/03-agents.md +232 -0
  22. package/PLAN/04-runtime.md +171 -0
  23. package/PLAN/05-tools.md +211 -0
  24. package/PLAN/06-communication.md +243 -0
  25. package/PLAN/07-storage.md +218 -0
  26. package/PLAN/08-api-cli.md +153 -0
  27. package/PLAN/09-permissions.md +108 -0
  28. package/PLAN/10-ably.md +105 -0
  29. package/PLAN/11-file-formats.md +442 -0
  30. package/PLAN/12-folder-structure.md +205 -0
  31. package/PLAN/13-operations.md +212 -0
  32. package/PLAN/README.md +23 -0
  33. package/README.md +128 -0
  34. package/REPORT.md +174 -0
  35. package/TODO.md +45 -0
  36. package/ai-tests/FRONTEND_PROMPT.md +220 -0
  37. package/ai-tests/Research & Planning.md +814 -0
  38. package/ai-tests/prompt-001-basic-api.md +230 -0
  39. package/ai-tests/prompt-002-basic-flows.md +230 -0
  40. package/ai-tests/prompt-003-agent-behaviors.md +220 -0
  41. package/api/middleware.js +60 -0
  42. package/api/routes/agents.js +193 -0
  43. package/api/routes/chat.js +93 -0
  44. package/api/routes/completions.js +122 -0
  45. package/api/routes/daemons.js +80 -0
  46. package/api/routes/memory.js +169 -0
  47. package/api/routes/models.js +40 -0
  48. package/api/routes/remote-methods.js +74 -0
  49. package/api/routes/sessions.js +208 -0
  50. package/api/routes/settings.js +108 -0
  51. package/api/routes/system.js +50 -0
  52. package/api/routes/tasks.js +270 -0
  53. package/api/server.js +120 -0
  54. package/cli/formatter.js +70 -0
  55. package/cli/index.js +443 -0
  56. package/cli/parser.js +113 -0
  57. package/config/config.json +10 -0
  58. package/config/models.json +6826 -0
  59. package/core/agent.js +329 -0
  60. package/core/cancel.js +38 -0
  61. package/core/compaction.js +176 -0
  62. package/core/events.js +13 -0
  63. package/core/loop.js +564 -0
  64. package/core/memory.js +51 -0
  65. package/core/prompt.js +185 -0
  66. package/core/queue.js +96 -0
  67. package/core/registry.js +291 -0
  68. package/core/remote-methods.js +124 -0
  69. package/core/router.js +386 -0
  70. package/core/running-sessions.js +18 -0
  71. package/docs/api/01-system.md +84 -0
  72. package/docs/api/02-agents.md +374 -0
  73. package/docs/api/03-chat.md +269 -0
  74. package/docs/api/04-tasks.md +470 -0
  75. package/docs/api/05-sessions.md +444 -0
  76. package/docs/api/06-daemons.md +142 -0
  77. package/docs/api/07-memory.md +186 -0
  78. package/docs/api/08-settings.md +133 -0
  79. package/docs/api/09-models.md +119 -0
  80. package/docs/api/09-websocket.md +350 -0
  81. package/docs/api/10-completions.md +134 -0
  82. package/docs/api/README.md +116 -0
  83. package/docs/guide/01-quickstart.md +220 -0
  84. package/docs/guide/02-folder-structure.md +185 -0
  85. package/docs/guide/03-configuration.md +252 -0
  86. package/docs/guide/04-agents.md +267 -0
  87. package/docs/guide/05-cli.md +290 -0
  88. package/docs/guide/06-tools.md +643 -0
  89. package/docs/guide/07-permissions.md +236 -0
  90. package/docs/guide/08-memory.md +139 -0
  91. package/docs/guide/09-multi-agent.md +271 -0
  92. package/docs/guide/10-daemons.md +226 -0
  93. package/docs/guide/README.md +53 -0
  94. package/docs/index.html +623 -0
  95. package/examples/README.md +151 -0
  96. package/examples/agents/assistant/AGENT.md +31 -0
  97. package/examples/agents/assistant/SOUL.md +9 -0
  98. package/examples/agents/assistant/agent.json +74 -0
  99. package/examples/agents/hello/AGENT.md +15 -0
  100. package/examples/agents/hello/agent.json +14 -0
  101. package/examples/agents/monitor/AGENT.md +51 -0
  102. package/examples/agents/monitor/agent.json +33 -0
  103. package/examples/agents/monitor/heartbeats/monitor.md +24 -0
  104. package/examples/agents/orchestrator/AGENT.md +70 -0
  105. package/examples/agents/orchestrator/agent.json +30 -0
  106. package/examples/agents/researcher/AGENT.md +52 -0
  107. package/examples/agents/researcher/agent.json +49 -0
  108. package/examples/agents/researcher/skills/web-research.md +28 -0
  109. package/examples/skills/code-review.md +72 -0
  110. package/examples/skills/summarise.md +59 -0
  111. package/examples/skills/web-research.md +42 -0
  112. package/examples/tools/word-count/index.js +27 -0
  113. package/examples/tools/word-count/tool.json +18 -0
  114. package/infrastructure/database.js +563 -0
  115. package/infrastructure/scheduler.js +122 -0
  116. package/llm/client.js +206 -0
  117. package/migrations/001-initial.sql +121 -0
  118. package/migrations/002-debuggability.sql +13 -0
  119. package/migrations/003-drop-orphaned-columns.sql +72 -0
  120. package/migrations/004-session-message-token-fields.sql +78 -0
  121. package/migrations/005-session-thinking.sql +5 -0
  122. package/package.json +30 -0
  123. package/schemas/agent.json +143 -0
  124. package/schemas/settings.json +111 -0
  125. package/scripts/fetch-models.js +93 -0
  126. package/session-debug-scenario.md +248 -0
  127. package/settings/fields.js +52 -0
  128. package/system-prompts/base-core.md +7 -0
  129. package/system-prompts/environment.md +13 -0
  130. package/system-prompts/reminders/anti-drift.md +6 -0
  131. package/system-prompts/reminders/stall-recovery.md +10 -0
  132. package/system-prompts/safety-rules.md +25 -0
  133. package/system-prompts/task-heuristics.md +27 -0
  134. package/test/client.js +71 -0
  135. package/test/integration/01-health.test.js +25 -0
  136. package/test/integration/02-agents.test.js +80 -0
  137. package/test/integration/03-chat-hello.test.js +48 -0
  138. package/test/integration/04-chat-multiturn.test.js +61 -0
  139. package/test/integration/05-chat-writer.test.js +48 -0
  140. package/test/integration/06-task-basic.test.js +68 -0
  141. package/test/integration/07-task-tools.test.js +74 -0
  142. package/test/integration/08-task-code-analysis.test.js +69 -0
  143. package/test/integration/09-memory-analyst.test.js +63 -0
  144. package/test/integration/10-task-advanced.test.js +85 -0
  145. package/test/integration/11-sessions-advanced.test.js +84 -0
  146. package/test/integration/12-assistant-chat-tools.test.js +75 -0
  147. package/test/integration/13-edge-cases.test.js +99 -0
  148. package/test/integration/14-cancel.test.js +62 -0
  149. package/test/integration/15-debug.test.js +106 -0
  150. package/test/integration/16-memory-api.test.js +83 -0
  151. package/test/integration/17-settings-api.test.js +41 -0
  152. package/test/integration/18-tool-search-activation.test.js +119 -0
  153. package/test/results/.gitkeep +0 -0
  154. package/test/runner.js +206 -0
  155. package/test/smoke.js +216 -0
  156. package/tools/agent_message.js +85 -0
  157. package/tools/agent_send.js +80 -0
  158. package/tools/agent_spawn.js +44 -0
  159. package/tools/bash.js +49 -0
  160. package/tools/edit_file.js +41 -0
  161. package/tools/glob.js +64 -0
  162. package/tools/grep.js +82 -0
  163. package/tools/list_dir.js +63 -0
  164. package/tools/log_write.js +31 -0
  165. package/tools/memory_read.js +38 -0
  166. package/tools/memory_search.js +65 -0
  167. package/tools/memory_write.js +42 -0
  168. package/tools/read_file.js +48 -0
  169. package/tools/sleep.js +22 -0
  170. package/tools/task_create.js +41 -0
  171. package/tools/task_respond.js +37 -0
  172. package/tools/task_spawn.js +64 -0
  173. package/tools/task_status.js +39 -0
  174. package/tools/task_subscribe.js +37 -0
  175. package/tools/todo_read.js +26 -0
  176. package/tools/todo_write.js +38 -0
  177. package/tools/tool_activate.js +24 -0
  178. package/tools/tool_search.js +24 -0
  179. package/tools/web_fetch.js +50 -0
  180. package/tools/web_search.js +52 -0
  181. package/tools/write_file.js +28 -0
  182. package/ui/api.js +190 -0
  183. package/ui/app.js +281 -0
  184. package/ui/index.html +382 -0
  185. package/ui/views/agents.js +377 -0
  186. package/ui/views/chat.js +610 -0
  187. package/ui/views/connection.js +96 -0
  188. package/ui/views/daemons.js +129 -0
  189. package/ui/views/feed.js +194 -0
  190. package/ui/views/memory.js +263 -0
  191. package/ui/views/models.js +146 -0
  192. package/ui/views/sessions.js +314 -0
  193. package/ui/views/settings.js +142 -0
  194. package/ui/views/tasks.js +415 -0
  195. package/utils/context.js +49 -0
  196. package/utils/id.js +16 -0
  197. package/utils/models.js +88 -0
  198. package/utils/paths.js +213 -0
  199. package/utils/settings.js +172 -0
@@ -0,0 +1,814 @@
1
+ # VeilCLI Test Engine — Research & Planning Document
2
+ *Prepared for handoff to local dev agent with full reasoning context*
3
+
4
+ ---
5
+
6
+ ## 📋 What This Document Is
7
+
8
+ This document is the output of a full research session between the project owner and an AI research assistant. Its purpose is to give a **local dev agent full context** to implement the VeilCLI Test Engine — without needing to re-ask any questions already answered here.
9
+
10
+ The dev agent's job: **read the VeilCLI codebase, then use this document to build the test engine.** Every section contains both the *what* and the *why/how we thought of it*.
11
+
12
+ ---
13
+
14
+ ## 🧠 Core Understanding — Read This First
15
+
16
+ ### What the test engine is
17
+ A **coded test suite** that lives inside the VeilCLI repo as its own independent runnable package. It is:
18
+ - Run via a single entry point with a CLI interface (`node test-engine.js` or similar)
19
+ - Able to filter by group (`--group basic-api`, `--group flows`, etc.)
20
+ - Self-contained: auto-spins a fresh isolated workspace + VeilCLI server, runs all tests, tears everything down
21
+ - Tests are hardcoded scenarios written in code — you add new ones by coding them
22
+
23
+ ### Why it exists
24
+ The project owner had basic scripts that returned PASS but real-world usage revealed broken behavior. The failures were **not** at the HTTP level — status codes were fine. The failures were in **agent runtime behavior**: tools appearing to succeed but producing wrong/empty results, agent loops breaking silently, memory not persisting correctly, inter-agent communication losing data in transit.
25
+
26
+ ### The fundamental verification philosophy
27
+ > **If a script can verify it deterministically → use a script. Only use an AI judge when a script would produce a false positive.**
28
+
29
+ Examples of when AI judge is needed vs. not:
30
+ - ✅ Script: Did `POST /agents` return 201 with an `id` field? → Script
31
+ - ✅ Script: Did the task events show `tool.start` for `read_file`? → Script
32
+ - ✅ Script: Does `GET /sessions/:id/messages` return token counts per message? → Script
33
+ - ⚠️ AI Judge: Agent A was supposed to call Agent B with a meaningful instruction — the tool fired (HTTP 200, taskId returned) but was the actual message content semantically valid or empty/garbled? → AI Judge
34
+ - ⚠️ AI Judge: Agent was given a task requiring it to read a file and report its contents — did it actually report content from the file, or did it say "I cannot read files"? (We are not testing model hallucination — we are testing that VeilCLI's tool pipeline delivered the file content to the agent correctly) → AI Judge
35
+ - ⚠️ AI Judge: Agent completed a multi-step agentic flow — did it actually complete a coherent start → tool loop → end cycle, or did it stall/give up halfway? → AI Judge
36
+
37
+ ### What is NOT being tested
38
+ - LLM response quality / intelligence / accuracy (the model is assumed capable — using Kimi K2.5)
39
+ - Model hallucination
40
+ - External service reliability (OpenRouter, web search availability)
41
+
42
+ ### AI Judge setup
43
+ - Separate external model — NOT the same VeilCLI runtime being tested (avoids circular testing)
44
+ - Called via direct API (not through VeilCLI's `/completions`)
45
+ - Used sparingly — only where explicitly decided per test
46
+ - Returns a structured verdict: PASS / FAIL + reasoning string
47
+
48
+ ---
49
+
50
+ ## 🏗️ Engine Architecture
51
+
52
+ ### Project location
53
+ Inside the VeilCLI repo, as its own package:
54
+ ```
55
+ VeilCLI/
56
+ └── test-engine/
57
+ ├── package.json
58
+ ├── index.js ← CLI entry point
59
+ ├── runner.js ← test runner (serial within group, parallel across groups)
60
+ ├── workspace.js ← workspace + server lifecycle manager
61
+ ├── client.js ← HTTP test client (thin wrapper around fetch)
62
+ ├── assert.js ← assertion helpers (standard + deep + AI judge)
63
+ ├── reporter.js ← console output + failure artifact preservation
64
+ ├── fixtures/ ← reusable agent configs, memory seeds, tool files
65
+ └── tests/
66
+ ├── basic-api/
67
+ ├── flows/
68
+ ├── agent-behaviors/
69
+ ├── tool-coverage/
70
+ ├── debuggability/
71
+ ├── cli/
72
+ └── ai-judged/
73
+ ```
74
+
75
+ ### Workspace lifecycle
76
+ ```
77
+ START RUN
78
+ → Create fresh temp folder: .veil-test-{timestamp}/
79
+ → Copy auth.json from project's .veil/auth.json
80
+ → Write settings.json (test-specific: low timeouts, test secret)
81
+ → Start VeilCLI server (child_process), wait for /health to respond
82
+ → Run all test groups
83
+ → On failure: preserve entire workspace + HTTP logs + AI judge transcripts
84
+ → On success: delete workspace
85
+ → Stop server
86
+ END RUN
87
+ ```
88
+
89
+ ### Verification depth — per test, not global
90
+ Each test explicitly defines what it checks:
91
+ - **Level 1**: HTTP status + response shape
92
+ - **Level 2**: Level 1 + specific field values
93
+ - **Level 3**: Level 2 + task events (tool call sequence, parameters)
94
+ - **Level 4**: Level 3 + file system / memory state / DB-visible state via API
95
+ - **Level 5**: Level 4 + AI judge semantic verification
96
+
97
+ ### Console output philosophy
98
+ Enough detail to understand what failed **without opening any files**. On failure, show:
99
+ - Which test failed
100
+ - What was expected vs. what was received
101
+ - Which step in the flow failed (not just final assertion)
102
+ - AI judge reasoning if applicable
103
+
104
+ ---
105
+
106
+ ## 📦 Fixtures
107
+
108
+ **Reasoning**: Tests need agents, memory files, and custom tools to be pre-defined so scenarios are reproducible and readable. A base set is always present in every workspace run; test-specific additions are layered on top.
109
+
110
+ The dev agent should look at `examples/` in the VeilCLI repo and the `agent.json` schema to understand what valid agent configs look like, then build fixture templates for:
111
+
112
+ ### Base fixtures (always installed)
113
+ ```
114
+ fixtures/
115
+ ├── agents/
116
+ │ ├── basic-chat/ ← minimal agent, chat mode only, no tools
117
+ │ ├── task-runner/ ← task mode, has file + memory tools
118
+ │ ├── memory-agent/ ← memory enabled, memory tools whitelisted
119
+ │ ├── tool-tester/ ← all built-in tools whitelisted
120
+ │ ├── restricted-agent/ ← specific tools explicitly denied
121
+ │ ├── orchestrator/ ← can spawn subagents, agent_spawn + task_create
122
+ │ └── worker/ ← subagent mode, spawned by orchestrator
123
+ ├── memory/
124
+ │ ├── global-seed.md ← pre-populated global memory for memory tests
125
+ │ └── agent-seed.md ← pre-populated agent memory
126
+ ├── tools/
127
+ │ └── echo-tool.js ← custom tool that just echoes input (for custom tool loading tests)
128
+ └── files/
129
+ ├── sample.txt ← readable test file for read_file tests
130
+ ├── sample-dir/ ← directory for list_dir / glob tests
131
+ └── grep-target.txt ← file with known content for grep tests
132
+ ```
133
+
134
+ **Dev agent instruction**: Look at the VeilCLI `schemas/` folder for the exact `agent.json` schema fields. Look at `tools/` for the structure of a built-in tool to understand how `echo-tool.js` should be shaped. Look at `examples/` for reference agent definitions.
135
+
136
+ ---
137
+
138
+ ## 🧪 Test Groups & Individual Tests
139
+
140
+ For each test below, the reasoning explains:
141
+ 1. **Why this test exists** (what real failure it catches)
142
+ 2. **How to implement it** (what API calls, what to verify)
143
+ 3. **Verification level** (script / deep / AI judge)
144
+
145
+ ---
146
+
147
+ ### GROUP 1: Basic API
148
+ *Reasoning: Before testing any behavior, validate that the HTTP surface is wired correctly. These tests have nothing to do with AI — they are pure contract tests. The previous basic scripts likely covered some of these, but may have missed field-level validation.*
149
+
150
+ ---
151
+
152
+ **TEST 1.1 — Agent CRUD**
153
+ - **Why**: Core API. If agent creation/read/update/delete is broken, nothing else works.
154
+ - **How**:
155
+ 1. `POST /agents` with a valid agent config → expect 201, response has `name` field matching input
156
+ 2. `GET /agents` → expect array containing the created agent
157
+ 3. `GET /agents/:name` → expect full config returned, verify key fields match what was sent
158
+ 4. `PUT /agents/:name` → update a field (e.g. temperature), then GET again and verify it changed
159
+ 5. `DELETE /agents/:name` → expect 200/204, then GET returns 404
160
+ - **Verification**: Level 2 (HTTP + field values)
161
+ - **Dev agent**: Check the exact response shape of each endpoint in `api/routes/` to know which fields to assert on
162
+
163
+ ---
164
+
165
+ **TEST 1.2 — Settings CRUD**
166
+ - **Why**: Settings affect the whole runtime. If read/write is broken, config changes don't take effect.
167
+ - **How**:
168
+ 1. `GET /settings` → verify response has expected top-level fields, API keys are redacted
169
+ 2. `PUT /settings` with a safe field change (e.g. bump `maxIterations`) → expect 200
170
+ 3. `GET /settings` again → verify the change is reflected
171
+ 4. Test `?level=project` vs `?level=merged` → verify merged includes defaults
172
+ - **Verification**: Level 2
173
+ - **Dev agent**: Look at `utils/settings loader` to understand the merge layers and which fields are safe to mutate in tests
174
+
175
+ ---
176
+
177
+ **TEST 1.3 — Health and Status**
178
+ - **Why**: Basic liveness. Also `/status` returns counts that other tests can use to validate state changes.
179
+ - **How**:
180
+ 1. `GET /health` → 200, no DB access required (verify it's fast)
181
+ 2. `GET /status` → verify fields: `uptime`, `cwd`, agent count, session count, task count
182
+ 3. Create an agent, re-check `/status` → agent count increased
183
+ - **Verification**: Level 2
184
+
185
+ ---
186
+
187
+ **TEST 1.4 — Auth / Secret enforcement**
188
+ - **Why**: If secret is set but not enforced, the security feature is broken silently.
189
+ - **How**:
190
+ 1. Start test server with `secret` set in settings
191
+ 2. `GET /agents` without header → expect 401
192
+ 3. `GET /agents` with wrong secret → expect 401
193
+ 4. `GET /agents` with correct `X-Veil-Secret` → expect 200
194
+ - **Verification**: Level 1
195
+
196
+ ---
197
+
198
+ **TEST 1.5 — Session CRUD**
199
+ - **Why**: Sessions are the backbone of chat continuity. Broken session management = broken chat history.
200
+ - **How**:
201
+ 1. `POST /sessions` pre-create → verify `id` returned
202
+ 2. `GET /sessions` → list contains the session
203
+ 3. `GET /sessions/:id` → verify fields: `agent`, `mode`, `status`, `message_count`
204
+ 4. `POST /sessions/:id/reset` → verify messages cleared (message_count = 0 after)
205
+ 5. `DELETE /sessions/:id` (soft) → verify status changed
206
+ 6. `DELETE /sessions/:id?hard=true` → verify permanent removal
207
+ - **Verification**: Level 2-3
208
+ - **Dev agent**: Check the session model in `infrastructure/` SQLite schema to know exact field names
209
+
210
+ ---
211
+
212
+ **TEST 1.6 — Models endpoint**
213
+ - **Why**: `/models` is used internally for context size limits and cost calculation. If broken, cost tracking fails silently.
214
+ - **How**:
215
+ 1. `GET /models` → verify `updated_at` field exists, `models` is non-empty array
216
+ 2. Each model has: `id`, `name`, `context_length`, `pricing`
217
+ 3. `GET /models/:provider/:name` → look up a known model, verify fields
218
+ - **Verification**: Level 2
219
+
220
+ ---
221
+
222
+ ### GROUP 2: Basic Flows
223
+ *Reasoning: The project owner's primary failure — scripts passed but real usage broke. These tests simulate what a real new user would do. They are end-to-end flows, not unit tests. They catch integration failures that unit-level API tests miss.*
224
+
225
+ ---
226
+
227
+ **TEST 2.1 — New user happy path (chat)**
228
+ - **Why**: This is THE most critical regression test. A new user creates one agent and chats. If this breaks, VeilCLI is unusable.
229
+ - **How**:
230
+ 1. Use `basic-chat` fixture agent (already installed in workspace)
231
+ 2. `POST /agents/:name/chat` with a simple message → expect 200, `message` field non-empty, `sessionId` returned
232
+ 3. Send a follow-up using the returned `sessionId` → verify the session continues (message_count increases)
233
+ 4. `GET /sessions/:id/messages` → verify both turns exist with correct roles
234
+ 5. Verify `input_tokens` and `output_tokens` are non-zero on messages
235
+ - **Verification**: Level 3
236
+ - **Dev agent**: Check `api/routes/` for the exact chat response shape. Check how sessionId is passed in subsequent requests.
237
+
238
+ ---
239
+
240
+ **TEST 2.2 — Async task full lifecycle**
241
+ - **Why**: Task mode is async — the lifecycle (pending → processing → finished) must transition correctly. Previous scripts likely just checked 202 and didn't poll.
242
+ - **How**:
243
+ 1. `POST /agents/task-runner/task` with a simple input → expect 202, `taskId` returned, status `pending`
244
+ 2. Poll `GET /tasks/:id` until status is `finished` or `failed` (with timeout — e.g. 60s)
245
+ 3. Verify final status is `finished`, `output` is non-empty
246
+ 4. `GET /tasks/:id/events` → verify events exist, at minimum one `status.change` event
247
+ 5. Verify `token_input` and `token_output` are non-zero on the task record
248
+ - **Verification**: Level 3
249
+ - **Dev agent**: Check the task polling logic — what's a safe poll interval? Check event types in `api/routes/tasks`.
250
+
251
+ ---
252
+
253
+ **TEST 2.3 — Chat with SSE streaming**
254
+ - **Why**: SSE mode is a separate code path. If broken, streaming clients (UI) get no response while non-streaming works fine.
255
+ - **How**:
256
+ 1. `POST /agents/:name/chat` with `{ sse: true }` → expect `text/event-stream` content type
257
+ 2. Collect all SSE events until `done` event received
258
+ 3. Verify `chunk` events were received (content streamed)
259
+ 4. Verify `done` event contains final message and `tokenUsage`
260
+ - **Verification**: Level 2-3
261
+ - **Dev agent**: Check how SSE is implemented in the chat route — what events are emitted and in what format.
262
+
263
+ ---
264
+
265
+ **TEST 2.4 — Session resumption**
266
+ - **Why**: A user closes their client and comes back later. The session must be resumable with history intact.
267
+ - **How**:
268
+ 1. Chat with agent, get `sessionId`
269
+ 2. Send second message with same `sessionId` — verify context is maintained (agent references earlier turn)
270
+ 3. Retrieve `GET /sessions/:id/messages` — verify full history is there
271
+ 4. `POST /sessions/:id/reset` — verify messages cleared but session still exists
272
+ 5. Chat again on same session — verify it works fresh
273
+ - **Verification**: Level 3 (message history) + optionally Level 5 (AI judge: does agent actually reference the earlier message?)
274
+ - **AI Judge note**: Use AI judge only to confirm context continuity — did the agent's response acknowledge the prior conversation? This is hard to script without semantic understanding.
275
+
276
+ ---
277
+
278
+ ### GROUP 3: Agent Behaviors
279
+ *Reasoning: These tests validate that VeilCLI correctly enforces agent configuration — permissions, tool access, mode restrictions. These were identified as a major source of silent failures.*
280
+
281
+ ---
282
+
283
+ **TEST 3.1 — Tool permission enforcement (deny list)**
284
+ - **Why**: If `disallowedTools` doesn't actually block tool usage, security and behavior boundaries are broken. A script check on HTTP alone won't catch this — the agent might try the tool and get an internal error that still returns HTTP 200.
285
+ - **How**:
286
+ 1. Use `restricted-agent` fixture (has `bash` in `disallowedTools`)
287
+ 2. Give agent a task that explicitly requires using `bash` ("run the command `echo hello`")
288
+ 3. Poll task to completion
289
+ 4. `GET /tasks/:id/events` → verify NO `tool.start` event for `bash` exists
290
+ 5. Verify task either finished (agent worked around it) or failed with appropriate error — NOT silently ran bash
291
+ - **Verification**: Level 3 (event log inspection)
292
+ - **Dev agent**: Check how `disallowedTools` is enforced in `core/agentic loop`. Understand what event is emitted when a tool is blocked.
293
+
294
+ ---
295
+
296
+ **TEST 3.2 — Tool whitelist enforcement (allow list)**
297
+ - **Why**: Mirror of above. If `tools` whitelist doesn't restrict, agent can use any tool.
298
+ - **How**:
299
+ 1. Create agent with only `read_file` in its tools whitelist
300
+ 2. Give it a task requiring `write_file`
301
+ 3. Verify `write_file` never appears in task events
302
+ - **Verification**: Level 3
303
+
304
+ ---
305
+
306
+ **TEST 3.3 — Mode enforcement**
307
+ - **Why**: An agent with `chat.enabled: false` should not be chattable. If mode enforcement is broken, wrong code paths execute.
308
+ - **How**:
309
+ 1. Create agent with `modes.chat.enabled: false`
310
+ 2. `POST /agents/:name/chat` → expect appropriate error response (not 200)
311
+ 3. Dev agent: check what error code/status VeilCLI returns for disabled modes
312
+ - **Verification**: Level 1
313
+
314
+ ---
315
+
316
+ **TEST 3.4 — maxIterations enforcement**
317
+ - **Why**: Without iteration limits, a broken agentic loop runs forever and burns tokens/cost.
318
+ - **How**:
319
+ 1. Create task with `maxIterations: 2`
320
+ 2. Give agent a task that would normally require many tool calls
321
+ 3. Verify task stops after 2 iterations — check `iterations` field on task record
322
+ 4. Verify `onExhausted: "fail"` → task status is `failed`; `onExhausted: "wait"` → task status is `waiting`
323
+ - **Verification**: Level 3
324
+
325
+ ---
326
+
327
+ **TEST 3.5 — Agent reload from disk**
328
+ - **Why**: `POST /agents/:name/reload` must actually refresh config. If it doesn't, live config changes never take effect.
329
+ - **How**:
330
+ 1. Create agent, read its config
331
+ 2. Directly modify the `agent.json` file on disk (change temperature)
332
+ 3. `POST /agents/:name/reload`
333
+ 4. `GET /agents/:name` → verify the changed field is now reflected
334
+ - **Verification**: Level 2
335
+
336
+ ---
337
+
338
+ ### GROUP 4: Tool Coverage
339
+ *Reasoning: There are 24 built-in tools. The previous scripts didn't test tool execution — they tested API endpoints. A tool can return HTTP 200 from the API but fail silently inside the agent loop. Each tool test gives an agent a task that REQUIRES that tool, then verifies via task events that the tool was called AND produced a meaningful result.*
340
+
341
+ *Pattern for each tool test:*
342
+ 1. *Create a task that cannot be completed without using the specific tool*
343
+ 2. *Poll to completion*
344
+ 3. *Check `GET /tasks/:id/events` for `tool.start` + `tool.end` for that tool*
345
+ 4. *Check the `tool.end` event result is not an error and has meaningful content*
346
+ 5. *For some tools, verify side effects (file created, memory updated, etc.)*
347
+
348
+ ---
349
+
350
+ **TEST 4.1 — File I/O tools**
351
+
352
+ `read_file`:
353
+ - Setup: Write a file to workspace with known content (e.g. `"TESTMARKER_XYZ"`)
354
+ - Task: "Read the file at [path] and tell me what's in it"
355
+ - Verify: `tool.end` result contains the file content, NOT an error string
356
+ - AI Judge: Was the agent able to report the file content? (catches case where tool pipeline broke and agent said "I cannot read files")
357
+
358
+ `write_file`:
359
+ - Task: "Write the text 'hello test' to a file called output.txt in [dir]"
360
+ - Verify: `tool.end` shows success, then actually check the file exists on disk with correct content
361
+
362
+ `edit_file`:
363
+ - Setup: Write a file with known content
364
+ - Task: "Edit the file at [path], replace 'OLD_TEXT' with 'NEW_TEXT'"
365
+ - Verify: File on disk now contains 'NEW_TEXT', not 'OLD_TEXT'
366
+
367
+ `list_dir`:
368
+ - Setup: Create a directory with 3 known files
369
+ - Task: "List the files in [dir] and report the filenames"
370
+ - Verify: `tool.end` result contains the expected filenames
371
+
372
+ `glob`:
373
+ - Setup: Create files with `.txt` and `.js` extensions in a dir
374
+ - Task: "Find all `.txt` files in [dir]"
375
+ - Verify: `tool.end` result matches only `.txt` files
376
+
377
+ `grep`:
378
+ - Setup: Create a file with known unique string
379
+ - Task: "Search for the pattern 'UNIQUE_GREP_MARKER' in [dir]"
380
+ - Verify: `tool.end` result contains the match with file + line reference
381
+
382
+ `bash`:
383
+ - Task: "Run the command `echo BASH_MARKER_TEST` and report the output"
384
+ - Verify: `tool.end` result contains `BASH_MARKER_TEST`
385
+ - This is a good deterministic test — the echo output is predictable
386
+
387
+ ---
388
+
389
+ **TEST 4.2 — Memory tools**
390
+
391
+ `memory_write` + `memory_read` (same test, sequential):
392
+ - Task 1: "Write a note to your memory: 'MEMORY_MARKER_12345'"
393
+ - Verify: `memory_write` appears in events, `tool.end` success
394
+ - Also verify: `GET /agents/:name/memory/MEMORY.md` via API → file contains the written text
395
+ - Task 2 (new session, same agent): "Read your memory and tell me what notes you have"
396
+ - Verify: `memory_read` in events, AND agent's response contains `MEMORY_MARKER_12345`
397
+ - AI Judge: Did the agent actually retrieve and report from memory, or did it say it has no memory? (Tests persistence across sessions)
398
+
399
+ `memory_search`:
400
+ - Setup: Pre-seed agent memory with several distinct entries
401
+ - Task: "Search your memory for [specific topic]"
402
+ - Verify: `memory_search` in events, result is relevant entry (not empty)
403
+
404
+ ---
405
+
406
+ **TEST 4.3 — Todo tools**
407
+
408
+ `todo_write` + `todo_read`:
409
+ - Task: "Plan the following 3 tasks as todos: [list]" (agent should use todo_write naturally)
410
+ - Verify: `todo_write` appears in events with structured `todos` array in parameters
411
+ - Then: "What are your current todos?" (agent should use todo_read)
412
+ - Verify: `todo_read` in events, response reflects the todos written
413
+ - *Note: todo tools are scoped to the current task — verify this scoping works correctly*
414
+
415
+ ---
416
+
417
+ **TEST 4.4 — Web tools**
418
+
419
+ `web_search`:
420
+ - Task: "Search the web for 'VeilCLI test engine'"
421
+ - Verify: `web_search` in events, `tool.end` result is non-empty (not error)
422
+ - *Note: This test may be flaky if DuckDuckGo is rate-limiting — mark it as non-blocking*
423
+
424
+ `web_fetch`:
425
+ - Task: "Fetch the content of https://example.com and tell me the page title"
426
+ - Verify: `web_fetch` in events, result contains HTML-stripped text, not an error
427
+
428
+ ---
429
+
430
+ **TEST 4.5 — Utility tools**
431
+
432
+ `sleep`:
433
+ - Task: "Wait 3 seconds then say done"
434
+ - Verify: `sleep` in events with `seconds: 3`, task took at least 3 seconds
435
+
436
+ `log_write`:
437
+ - Task: "Write a log entry saying 'TEST_LOG_MARKER'"
438
+ - Verify: `log_write` in events, then `GET /tasks/:id/events` → a `log` event contains `TEST_LOG_MARKER`
439
+
440
+ `tool_search`:
441
+ - Task: "Search your available tools for memory-related tools"
442
+ - Verify: `tool_search` in events, result is non-empty
443
+
444
+ ---
445
+
446
+ **TEST 4.6 — Multi-agent tools** *(also covered in Group 5, but event-level here)*
447
+
448
+ `task_create`:
449
+ - Task: "Create a new task for the worker agent with input 'say hello'"
450
+ - Verify: `task_create` in events, `tool.end` contains a valid `taskId`
451
+ - Then: poll that taskId via API → verify it actually exists and runs
452
+
453
+ `task_status`:
454
+ - Setup: Create a task programmatically via API
455
+ - Task: "Check the status of task [id]"
456
+ - Verify: `task_status` in events, result contains a valid status string
457
+
458
+ `task_respond`:
459
+ - Setup: Create a task with `onExhausted: "wait"`, let it reach waiting state
460
+ - Task: Tell an agent "respond to task [id] with 'continue'"
461
+ - Verify: `task_respond` in events, target task resumes
462
+
463
+ `agent_message`:
464
+ - Task: "Send a message to the worker agent asking it to say hello, wait for the reply"
465
+ - Verify: `agent_message` in events, `tool.end` result is non-empty reply
466
+ - AI Judge: Was the reply from the worker agent meaningful (not empty/error)?
467
+
468
+ `agent_send`:
469
+ - Task: "Send a fire-and-forget message to the worker agent"
470
+ - Verify: `agent_send` in events, `tool.end` success (not expecting a reply)
471
+
472
+ `agent_spawn (wait=true)`:
473
+ - Task: "Spawn the worker agent to [do something], wait for its result, then report it"
474
+ - Verify: `agent_spawn` in events with `wait: true`, result is non-empty, task has `parent_task_id` set
475
+ - AI Judge: Did the orchestrator's final response actually incorporate the worker's output?
476
+
477
+ `agent_spawn (wait=false)`:
478
+ - Task: "Spawn 2 worker agents in parallel to [do something], collect their taskIds"
479
+ - Verify: Two `agent_spawn` events with `wait: false`, two separate taskIds returned
480
+
481
+ `task_subscribe`:
482
+ - Task: Create a task, have agent subscribe to it
483
+ - Verify: `task_subscribe` in events, subscription exists in DB (check via API behavior after task completes)
484
+
485
+ ---
486
+
487
+ **TEST 4.7 — Custom tool loading**
488
+ - **Why**: Custom tools in `.veil/agents/<name>/tools/` must be auto-discovered and available.
489
+ - **How**:
490
+ 1. Install `echo-tool.js` fixture into the test agent's tools folder
491
+ 2. `GET /agents/:name/skills` → verify custom tool appears in the list
492
+ 3. Give agent a task that calls the echo tool
493
+ 4. Verify `tool.start` + `tool.end` in events for the custom tool name
494
+ - **Dev agent**: Look at how custom tools are loaded in `core/agent loader` and what the `schema + execute` export shape must be
495
+
496
+ ---
497
+
498
+ ### GROUP 5: Multi-Agent Flows
499
+ *Reasoning: Multi-agent communication is where the most subtle bugs hide. HTTP 200 is meaningless here — the tool can "succeed" but pass empty, truncated, or malformed data between agents. This is the primary group requiring AI judge.*
500
+
501
+ ---
502
+
503
+ **TEST 5.1 — Orchestrator spawns worker (sync, wait=true)**
504
+ - **Why**: The full sync delegation pattern. If `agent_spawn(wait=true)` passes an empty instruction or loses the result, the orchestrator gets nothing but no error is raised.
505
+ - **How**:
506
+ 1. Give orchestrator a task: "Spawn the worker agent and ask it to [specific task], then report exactly what it said"
507
+ 2. Poll orchestrator task to completion
508
+ 3. Check events: `agent_spawn` was called with `wait: true` and a non-empty `instruction`
509
+ 4. Check worker task was created with a `parent_task_id` matching orchestrator task
510
+ 5. Check orchestrator's output references the worker's result
511
+ - **Verification**: Level 3 + AI Judge
512
+ - **AI Judge**: "Given this orchestrator output and this worker output, did the orchestrator correctly incorporate the worker's response?"
513
+
514
+ ---
515
+
516
+ **TEST 5.2 — Parallel fan-out (wait=false)**
517
+ - **Why**: Parallel spawning is a complex pattern. Each `agent_spawn(wait=false)` must return a distinct `taskId` immediately, and both child tasks must actually run.
518
+ - **How**:
519
+ 1. Give orchestrator: "Spawn 2 worker agents in parallel with different tasks, collect their results"
520
+ 2. Verify: Two distinct `agent_spawn` calls in events, each with `wait: false`
521
+ 3. Verify: Two child tasks exist in DB with `parent_task_id` set
522
+ 4. Verify: Both child tasks eventually reach `finished`
523
+ 5. Verify: Orchestrator output references both results
524
+ - **Verification**: Level 3 + AI Judge
525
+
526
+ ---
527
+
528
+ **TEST 5.3 — Agent messaging (agent_message sync)**
529
+ - **Why**: `agent_message` is synchronous — the caller blocks. If the target agent's response is lost or the call hangs, the calling agent stalls indefinitely.
530
+ - **How**:
531
+ 1. Give orchestrator: "Message the worker agent and ask it what 2+2 is, report the answer"
532
+ 2. Verify: `agent_message` in events, `tool.end` has non-empty result
533
+ 3. Verify: Orchestrator output contains the answer
534
+ - **AI Judge**: Was the answer from the worker agent passed through correctly?
535
+
536
+ ---
537
+
538
+ **TEST 5.4 — Durable task subscription**
539
+ - **Why**: `task_subscribe` writes to SQLite and survives server restart. If this breaks, agents lose notification of subtask completion.
540
+ - **How**:
541
+ 1. Create a long-running task (use sleep tool)
542
+ 2. Have subscriber agent call `task_subscribe` on it
543
+ 3. Verify subscription exists (via runtime behavior — when task completes, subscriber gets notified)
544
+ 4. Wait for target task to complete
545
+ 5. Verify subscriber agent received notification (check its session events)
546
+ - **Dev agent**: Check `task_subscriptions` table schema and how notifications are injected into subscriber sessions
547
+
548
+ ---
549
+
550
+ **TEST 5.5 — maxSubAgentDepth enforcement**
551
+ - **Why**: Without depth limits, a rogue agent can spawn infinitely deep chains.
552
+ - **How**:
553
+ 1. Set `maxSubAgentDepth: 2` in test settings
554
+ 2. Create orchestrator that spawns a worker that spawns another worker
555
+ 3. Verify: Third level spawn is rejected with appropriate error
556
+ - **Dev agent**: Check what error/event is emitted when depth limit is exceeded in `core/`
557
+
558
+ ---
559
+
560
+ ### GROUP 6: Debuggability
561
+ *Reasoning: The project owner specifically called this out. VeilCLI claims to provide full observability — token counts, cost, tool traces, conversation history. If these are missing or wrong, developers can't debug agent behavior in production. These tests verify the instrumentation works, not the agent behavior.*
562
+
563
+ ---
564
+
565
+ **TEST 6.1 — Token tracking per message**
566
+ - **Why**: Token counts must exist on every message for cost accountability.
567
+ - **How**:
568
+ 1. Run a chat conversation (2-3 turns)
569
+ 2. `GET /sessions/:id/messages` → every assistant message has `output_tokens > 0`, every user message has `input_tokens > 0`
570
+ 3. Verify no message has `null` or `0` tokens (unless it's a system message — dev agent: check if system messages are in the message list)
571
+ - **Verification**: Level 2
572
+
573
+ ---
574
+
575
+ **TEST 6.2 — Token tracking per session**
576
+ - **Why**: Session-level totals must roll up correctly from message-level.
577
+ - **How**:
578
+ 1. After conversation above, `GET /sessions/:id`
579
+ 2. Verify `token_input` and `token_output` are non-zero
580
+ 3. Verify they are >= sum of individual message tokens (system prompt adds overhead)
581
+ - **Verification**: Level 2
582
+
583
+ ---
584
+
585
+ **TEST 6.3 — Cost calculation**
586
+ - **Why**: Cost must be calculated even when the LLM provider doesn't return it natively (fallback to pricing model).
587
+ - **How**:
588
+ 1. After a conversation, `GET /sessions/:id` → verify `cost` is a non-zero number
589
+ 2. `GET /sessions/:id/messages` → verify individual message `cost` fields exist
590
+ 3. The session cost should approximately equal sum of message costs
591
+ - **Verification**: Level 2
592
+
593
+ ---
594
+
595
+ **TEST 6.4 — Task event trace completeness**
596
+ - **Why**: Task events must capture the full agentic loop for debugging. If events are missing, you can't trace what happened.
597
+ - **How**:
598
+ 1. Run a task that uses at least 2 different tools
599
+ 2. `GET /tasks/:id/events` → verify:
600
+ - At least one `status.change` event (pending → processing)
601
+ - One `tool.start` + `tool.end` pair per tool used
602
+ - Final `status.change` to `finished`
603
+ 3. Verify `tool.start` events contain tool name and parameters
604
+ 4. Verify `tool.end` events contain result (not empty)
605
+ - **Verification**: Level 3
606
+
607
+ ---
608
+
609
+ **TEST 6.5 — Context snapshot**
610
+ - **Why**: `GET /tasks/:id/context` must return a valid LLM context snapshot for debugging mid-task.
611
+ - **How**:
612
+ 1. Run a task
613
+ 2. `GET /tasks/:id/context` → verify `messages`, `tools`, `iteration` fields exist
614
+ 3. `messages` array is non-empty, contains at least system + user turn
615
+ 4. `tools` is non-empty array (tool schemas)
616
+ - **Verification**: Level 2
617
+
618
+ ---
619
+
620
+ **TEST 6.6 — Session message history**
621
+ - **Why**: `GET /sessions/:id/messages` is the primary debugging tool for chat. Pagination must work.
622
+ - **How**:
623
+ 1. Create a chat session with 10+ turns
624
+ 2. `GET /sessions/:id/messages?limit=3&offset=0` → verify exactly 3 messages returned
625
+ 3. `GET /sessions/:id/messages?limit=3&offset=3` → verify next 3
626
+ 4. Verify `role` field is either `user`, `assistant`, or `system` on each
627
+ - **Verification**: Level 2
628
+
629
+ ---
630
+
631
+ ### GROUP 7: CLI Tool
632
+ *Reasoning: VeilCLI has a CLI entry point (start, stop, status, agents, login). The CLI is a different code path than the REST API. If the CLI is broken, new users can't even start the server. These tests invoke the actual CLI binary via shell.*
633
+
634
+ ---
635
+
636
+ **TEST 7.1 — Server start/stop via CLI**
637
+ - **Why**: `veil start` and `veil stop` are the primary user entry points.
638
+ - **How**:
639
+ 1. Stop the test server (temporarily)
640
+ 2. Run `veil start` in test workspace → wait for it to be ready (`GET /health` responds)
641
+ 3. Run `veil status` → verify output contains running server info
642
+ 4. Run `veil stop` → server goes down (`GET /health` fails)
643
+ - **Dev agent**: Check `cli/` folder for exact command names and expected output format
644
+
645
+ ---
646
+
647
+ **TEST 7.2 — Agent listing via CLI**
648
+ - **How**:
649
+ 1. With server running, run `veil agents` (or equivalent)
650
+ 2. Verify output contains the test agents created in workspace
651
+ - **Dev agent**: Check CLI command name and output format in `cli/`
652
+
653
+ ---
654
+
655
+ **TEST 7.3 — CLI chat flow**
656
+ - **Why**: A user should be able to start a chat from the CLI, get a response, and resume it.
657
+ - **How**:
658
+ 1. Run `veil chat <agent-name>` with piped input (or appropriate flag for non-interactive mode)
659
+ 2. Verify response is printed to stdout
660
+ 3. Note the session ID from output
661
+ 4. Re-run with session flag to resume — verify agent acknowledges context
662
+ - **Dev agent**: Check the `cli/` start command and whether there's a chat subcommand or how the CLI initiates chat. Look for non-interactive mode flags for testability.
663
+
664
+ ---
665
+
666
+ ### GROUP 8: Daemon Mode
667
+ *Reasoning: Daemon mode runs on cron schedules. Tests can't wait for actual cron — use `POST /agents/:name/daemon/trigger` to fire immediately.*
668
+
669
+ ---
670
+
671
+ **TEST 8.1 — Daemon start/stop/trigger**
672
+ - **How**:
673
+ 1. Create daemon agent with a schedule
674
+ 2. `POST /agents/:name/daemon/start` → verify it appears in `GET /daemons`
675
+ 3. `POST /agents/:name/daemon/trigger` → verify it fires (a new task is created)
676
+ 4. Poll that task to completion
677
+ 5. `POST /agents/:name/daemon/stop` → verify removed from `GET /daemons`
678
+ - **Verification**: Level 3
679
+
680
+ ---
681
+
682
+ **TEST 8.2 — Daemon conflict policy (skip)**
683
+ - **How**:
684
+ 1. Start daemon, trigger it
685
+ 2. While first tick is still running, trigger again
686
+ 3. With `conflictPolicy: "skip"` → second tick should be skipped (no second task created)
687
+ - **Dev agent**: Check how conflict policy is implemented in `infrastructure/cron scheduler`
688
+
689
+ ---
690
+
691
+ **TEST 8.3 — Daemon reads heartbeat file**
692
+ - **Why**: Daemon agents read from `.veil/heartbeats/<name>.md` each tick for instructions. If this is broken, daemon behavior can't be configured at runtime.
693
+ - **How**:
694
+ 1. Write a specific instruction to the heartbeat file
695
+ 2. Trigger daemon
696
+ 3. AI Judge: Did the daemon's task output reflect the heartbeat instruction?
697
+ - **Verification**: Level 3 + AI Judge
698
+
699
+ ---
700
+
701
+ ### GROUP 9: Memory Persistence
702
+ *Reasoning: Memory is a markdown file on disk, injected into system prompts at session start. The key thing to test: does the agent's memory actually persist across sessions and influence behavior?*
703
+
704
+ ---
705
+
706
+ **TEST 9.1 — Memory write + API read**
707
+ - **How**:
708
+ 1. Agent writes to memory via `memory_write` tool in a task
709
+ 2. `GET /agents/:name/memory/MEMORY.md` → verify the written content appears
710
+ 3. `PUT /agents/:name/memory/:file` via API → write directly
711
+ 4. `GET /agents/:name/memory/:file` → verify roundtrip
712
+ - **Verification**: Level 2-3
713
+
714
+ ---
715
+
716
+ **TEST 9.2 — Memory influences next session**
717
+ - **Why**: Memory is injected at session start. If injection is broken, agents don't have memory even if the file exists.
718
+ - **How**:
719
+ 1. Pre-seed agent memory with `"MEMORY_SEED_MARKER_ABC"`
720
+ 2. Start a new chat session
721
+ 3. Ask agent: "What do you remember about yourself?"
722
+ 4. Verify agent mentions the marker
723
+ - **AI Judge**: Did the agent's response demonstrate it had access to the memory content? (This is the canonical AI judge use case — a script can't verify semantic incorporation)
724
+ - **Verification**: Level 5 (AI Judge)
725
+
726
+ ---
727
+
728
+ **TEST 9.3 — Global vs agent memory scoping**
729
+ - **How**:
730
+ 1. Write to global memory (`PUT /memory/MEMORY.md`)
731
+ 2. Write to agent-specific memory (`PUT /agents/:name/memory/MEMORY.md`)
732
+ 3. Start chat with agent that has memory enabled
733
+ 4. Verify both memories are injected (ask agent about both markers)
734
+ - **AI Judge**: Did the agent demonstrate awareness of both global and agent-specific memory?
735
+
736
+ ---
737
+
738
+ ## 🔧 Implementation Notes for Dev Agent
739
+
740
+ ### How to detect tool usage in events
741
+ Every agentic tool call produces a `tool.start` and `tool.end` event in `GET /tasks/:id/events`. The `tool.start` event contains the tool name and parameters. The `tool.end` event contains the result. **This is the primary verification mechanism for tool tests** — don't try to infer tool usage from the agent's text response.
742
+
743
+ ### Polling pattern for async tasks
744
+ ```javascript
745
+ async function pollTask(taskId, timeoutMs = 60000) {
746
+ const start = Date.now();
747
+ while (Date.now() - start < timeoutMs) {
748
+ const task = await client.get(`/tasks/${taskId}`);
749
+ if (['finished', 'failed', 'canceled'].includes(task.status)) return task;
750
+ await sleep(2000);
751
+ }
752
+ throw new Error(`Task ${taskId} timed out`);
753
+ }
754
+ ```
755
+
756
+ ### AI Judge invocation pattern
757
+ ```javascript
758
+ // Only use when a script cannot verify semantically
759
+ async function aiJudge(context, criteria) {
760
+ // Call external model API directly (not VeilCLI /completions)
761
+ // Prompt: "Given this [context], did [criteria]? Reply TEST_PASS or TEST_FAIL followed by one sentence reasoning."
762
+ // Parse response: starts with TEST_PASS → pass, TEST_FAIL → fail
763
+ // Log full prompt + response to failure artifacts always
764
+ }
765
+ ```
766
+
767
+ ### On failure: preserve everything
768
+ When any assertion fails:
769
+ - Keep the entire test workspace folder (all agent files, DB, memory)
770
+ - Dump all HTTP requests/responses made during that test
771
+ - Dump AI judge prompts and responses
772
+ - Print workspace path to console so developer can inspect directly
773
+
774
+ ### Test agent design
775
+ - `basic-chat`: chat mode only, no tools, memory disabled — for pure API tests
776
+ - `task-runner`: task mode, tools: `[read_file, write_file, bash, list_dir, glob, grep, sleep, log_write, todo_write, todo_read, tool_search]`
777
+ - `memory-agent`: memory enabled, tools: `[memory_read, memory_write, memory_search]`
778
+ - `orchestrator`: tools: `[agent_spawn, agent_message, agent_send, task_create, task_status, task_respond, task_subscribe]`, allowedAgents: `[worker]`
779
+ - `worker`: subagent mode, tools: `[read_file, write_file, bash, memory_read, memory_write]`
780
+ - `restricted-agent`: all tools denied except `read_file`
781
+ - `daemon-agent`: daemon mode with a schedule, reads heartbeat
782
+
783
+ ### Dev agent: things to check in the codebase before implementing
784
+ 1. `api/routes/` — exact request/response shapes for every endpoint (don't guess field names)
785
+ 2. `core/agentic loop` — how iterations work, how tools are called, how events are emitted
786
+ 3. `infrastructure/` — SQLite schema (field names on tasks, sessions, messages, events)
787
+ 4. `cli/` — exact command names and whether non-interactive mode exists for chat
788
+ 5. `schemas/agent.json` — exact valid fields for agent configs
789
+ 6. `tools/` — each tool's schema to understand what parameters and return values look like
790
+ 7. `system-prompts/` — what's injected automatically so you don't duplicate in test agents
791
+
792
+ ---
793
+
794
+ ## ✅ Summary of All Decisions Made in Research Session
795
+
796
+ | Decision | Value |
797
+ |---|---|
798
+ | Engine type | Test suite + validation runtime |
799
+ | Location | Inside VeilCLI repo, own package |
800
+ | Entry point | CLI (`node test-engine.js`, group filters) |
801
+ | Server lifecycle | Auto-start/stop per run |
802
+ | Workspace | Fresh per run, preserved on failure |
803
+ | Parallelism | Parallel across groups, serial within group |
804
+ | AI Judge model | External model, NOT VeilCLI `/completions` |
805
+ | AI Judge usage | Only when scripts produce false positives |
806
+ | AI Judge trigger | Manually decided per test |
807
+ | Fixture strategy | Reusable base set + test-specific additions |
808
+ | Test authorship | Hardcoded in engine code, add by coding |
809
+ | Verification depth | Per-test, explicitly defined |
810
+ | LLM calls | Real (no mocking), tolerate variability |
811
+ | Reproducibility | 3 identical runs should all pass |
812
+ | Output | Console: enough detail to understand failure without opening files |
813
+ | Failure artifacts | Workspace + HTTP logs + AI judge transcripts |
814
+ | CLI tests | Via shell exec of actual `veil` binary |