npm - @newsails/veil-cli - Versions diffs - 1.0.1 - Mend

@newsails/veil-cli 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (199) hide show

package/.veil/agents/analyst/AGENT.md +21 -0
package/.veil/agents/analyst/agent.json +23 -0
package/.veil/agents/assistant/AGENT.md +15 -0
package/.veil/agents/assistant/agent.json +19 -0
package/.veil/agents/coder/AGENT.md +18 -0
package/.veil/agents/coder/agent.json +19 -0
package/.veil/agents/hello/AGENT.md +5 -0
package/.veil/agents/hello/agent.json +13 -0
package/.veil/agents/writer/AGENT.md +12 -0
package/.veil/agents/writer/agent.json +17 -0
package/.veil/memory/MEMORY.md +343 -0
package/.veil/memory/agents/analyst/MEMORY.md +55 -0
package/.veil/memory/agents/hello/MEMORY.md +12 -0
package/.veil/runtime.pid +1 -0
package/.veil/settings.json +10 -0
package/.veil-studio/studio.db +0 -0
package/.veil-studio/studio.db-shm +0 -0
package/.veil-studio/studio.db-wal +0 -0
package/PLAN/01-vision.md +26 -0
package/PLAN/02-tech-stack.md +94 -0
package/PLAN/03-agents.md +232 -0
package/PLAN/04-runtime.md +171 -0
package/PLAN/05-tools.md +211 -0
package/PLAN/06-communication.md +243 -0
package/PLAN/07-storage.md +218 -0
package/PLAN/08-api-cli.md +153 -0
package/PLAN/09-permissions.md +108 -0
package/PLAN/10-ably.md +105 -0
package/PLAN/11-file-formats.md +442 -0
package/PLAN/12-folder-structure.md +205 -0
package/PLAN/13-operations.md +212 -0
package/PLAN/README.md +23 -0
package/README.md +128 -0
package/REPORT.md +174 -0
package/TODO.md +45 -0
package/ai-tests/FRONTEND_PROMPT.md +220 -0
package/ai-tests/Research & Planning.md +814 -0
package/ai-tests/prompt-001-basic-api.md +230 -0
package/ai-tests/prompt-002-basic-flows.md +230 -0
package/ai-tests/prompt-003-agent-behaviors.md +220 -0
package/api/middleware.js +60 -0
package/api/routes/agents.js +193 -0
package/api/routes/chat.js +93 -0
package/api/routes/completions.js +122 -0
package/api/routes/daemons.js +80 -0
package/api/routes/memory.js +169 -0
package/api/routes/models.js +40 -0
package/api/routes/remote-methods.js +74 -0
package/api/routes/sessions.js +208 -0
package/api/routes/settings.js +108 -0
package/api/routes/system.js +50 -0
package/api/routes/tasks.js +270 -0
package/api/server.js +120 -0
package/cli/formatter.js +70 -0
package/cli/index.js +443 -0
package/cli/parser.js +113 -0
package/config/config.json +10 -0
package/config/models.json +6826 -0
package/core/agent.js +329 -0
package/core/cancel.js +38 -0
package/core/compaction.js +176 -0
package/core/events.js +13 -0
package/core/loop.js +564 -0
package/core/memory.js +51 -0
package/core/prompt.js +185 -0
package/core/queue.js +96 -0
package/core/registry.js +291 -0
package/core/remote-methods.js +124 -0
package/core/router.js +386 -0
package/core/running-sessions.js +18 -0
package/docs/api/01-system.md +84 -0
package/docs/api/02-agents.md +374 -0
package/docs/api/03-chat.md +269 -0
package/docs/api/04-tasks.md +470 -0
package/docs/api/05-sessions.md +444 -0
package/docs/api/06-daemons.md +142 -0
package/docs/api/07-memory.md +186 -0
package/docs/api/08-settings.md +133 -0
package/docs/api/09-models.md +119 -0
package/docs/api/09-websocket.md +350 -0
package/docs/api/10-completions.md +134 -0
package/docs/api/README.md +116 -0
package/docs/guide/01-quickstart.md +220 -0
package/docs/guide/02-folder-structure.md +185 -0
package/docs/guide/03-configuration.md +252 -0
package/docs/guide/04-agents.md +267 -0
package/docs/guide/05-cli.md +290 -0
package/docs/guide/06-tools.md +643 -0
package/docs/guide/07-permissions.md +236 -0
package/docs/guide/08-memory.md +139 -0
package/docs/guide/09-multi-agent.md +271 -0
package/docs/guide/10-daemons.md +226 -0
package/docs/guide/README.md +53 -0
package/docs/index.html +623 -0
package/examples/README.md +151 -0
package/examples/agents/assistant/AGENT.md +31 -0
package/examples/agents/assistant/SOUL.md +9 -0
package/examples/agents/assistant/agent.json +74 -0
package/examples/agents/hello/AGENT.md +15 -0
package/examples/agents/hello/agent.json +14 -0
package/examples/agents/monitor/AGENT.md +51 -0
package/examples/agents/monitor/agent.json +33 -0
package/examples/agents/monitor/heartbeats/monitor.md +24 -0
package/examples/agents/orchestrator/AGENT.md +70 -0
package/examples/agents/orchestrator/agent.json +30 -0
package/examples/agents/researcher/AGENT.md +52 -0
package/examples/agents/researcher/agent.json +49 -0
package/examples/agents/researcher/skills/web-research.md +28 -0
package/examples/skills/code-review.md +72 -0
package/examples/skills/summarise.md +59 -0
package/examples/skills/web-research.md +42 -0
package/examples/tools/word-count/index.js +27 -0
package/examples/tools/word-count/tool.json +18 -0
package/infrastructure/database.js +563 -0
package/infrastructure/scheduler.js +122 -0
package/llm/client.js +206 -0
package/migrations/001-initial.sql +121 -0
package/migrations/002-debuggability.sql +13 -0
package/migrations/003-drop-orphaned-columns.sql +72 -0
package/migrations/004-session-message-token-fields.sql +78 -0
package/migrations/005-session-thinking.sql +5 -0
package/package.json +30 -0
package/schemas/agent.json +143 -0
package/schemas/settings.json +111 -0
package/scripts/fetch-models.js +93 -0
package/session-debug-scenario.md +248 -0
package/settings/fields.js +52 -0
package/system-prompts/base-core.md +7 -0
package/system-prompts/environment.md +13 -0
package/system-prompts/reminders/anti-drift.md +6 -0
package/system-prompts/reminders/stall-recovery.md +10 -0
package/system-prompts/safety-rules.md +25 -0
package/system-prompts/task-heuristics.md +27 -0
package/test/client.js +71 -0
package/test/integration/01-health.test.js +25 -0
package/test/integration/02-agents.test.js +80 -0
package/test/integration/03-chat-hello.test.js +48 -0
package/test/integration/04-chat-multiturn.test.js +61 -0
package/test/integration/05-chat-writer.test.js +48 -0
package/test/integration/06-task-basic.test.js +68 -0
package/test/integration/07-task-tools.test.js +74 -0
package/test/integration/08-task-code-analysis.test.js +69 -0
package/test/integration/09-memory-analyst.test.js +63 -0
package/test/integration/10-task-advanced.test.js +85 -0
package/test/integration/11-sessions-advanced.test.js +84 -0
package/test/integration/12-assistant-chat-tools.test.js +75 -0
package/test/integration/13-edge-cases.test.js +99 -0
package/test/integration/14-cancel.test.js +62 -0
package/test/integration/15-debug.test.js +106 -0
package/test/integration/16-memory-api.test.js +83 -0
package/test/integration/17-settings-api.test.js +41 -0
package/test/integration/18-tool-search-activation.test.js +119 -0
package/test/results/.gitkeep +0 -0
package/test/runner.js +206 -0
package/test/smoke.js +216 -0
package/tools/agent_message.js +85 -0
package/tools/agent_send.js +80 -0
package/tools/agent_spawn.js +44 -0
package/tools/bash.js +49 -0
package/tools/edit_file.js +41 -0
package/tools/glob.js +64 -0
package/tools/grep.js +82 -0
package/tools/list_dir.js +63 -0
package/tools/log_write.js +31 -0
package/tools/memory_read.js +38 -0
package/tools/memory_search.js +65 -0
package/tools/memory_write.js +42 -0
package/tools/read_file.js +48 -0
package/tools/sleep.js +22 -0
package/tools/task_create.js +41 -0
package/tools/task_respond.js +37 -0
package/tools/task_spawn.js +64 -0
package/tools/task_status.js +39 -0
package/tools/task_subscribe.js +37 -0
package/tools/todo_read.js +26 -0
package/tools/todo_write.js +38 -0
package/tools/tool_activate.js +24 -0
package/tools/tool_search.js +24 -0
package/tools/web_fetch.js +50 -0
package/tools/web_search.js +52 -0
package/tools/write_file.js +28 -0
package/ui/api.js +190 -0
package/ui/app.js +281 -0
package/ui/index.html +382 -0
package/ui/views/agents.js +377 -0
package/ui/views/chat.js +610 -0
package/ui/views/connection.js +96 -0
package/ui/views/daemons.js +129 -0
package/ui/views/feed.js +194 -0
package/ui/views/memory.js +263 -0
package/ui/views/models.js +146 -0
package/ui/views/sessions.js +314 -0
package/ui/views/settings.js +142 -0
package/ui/views/tasks.js +415 -0
package/utils/context.js +49 -0
package/utils/id.js +16 -0
package/utils/models.js +88 -0
package/utils/paths.js +213 -0
package/utils/settings.js +172 -0

package/ai-tests/Research & Planning.md ADDED Viewed

@@ -0,0 +1,814 @@
+# VeilCLI Test Engine — Research & Planning Document
+*Prepared for handoff to local dev agent with full reasoning context*
+---
+## 📋 What This Document Is
+This document is the output of a full research session between the project owner and an AI research assistant. Its purpose is to give a **local dev agent full context** to implement the VeilCLI Test Engine — without needing to re-ask any questions already answered here.
+The dev agent's job: **read the VeilCLI codebase, then use this document to build the test engine.** Every section contains both the *what* and the *why/how we thought of it*.
+---
+## 🧠 Core Understanding — Read This First
+### What the test engine is
+A **coded test suite** that lives inside the VeilCLI repo as its own independent runnable package. It is:
+- Run via a single entry point with a CLI interface (`node test-engine.js` or similar)
+- Able to filter by group (`--group basic-api`, `--group flows`, etc.)
+- Self-contained: auto-spins a fresh isolated workspace + VeilCLI server, runs all tests, tears everything down
+- Tests are hardcoded scenarios written in code — you add new ones by coding them
+### Why it exists
+The project owner had basic scripts that returned PASS but real-world usage revealed broken behavior. The failures were **not** at the HTTP level — status codes were fine. The failures were in **agent runtime behavior**: tools appearing to succeed but producing wrong/empty results, agent loops breaking silently, memory not persisting correctly, inter-agent communication losing data in transit.
+### The fundamental verification philosophy
+> **If a script can verify it deterministically → use a script. Only use an AI judge when a script would produce a false positive.**
+Examples of when AI judge is needed vs. not:
+- ✅ Script: Did `POST /agents` return 201 with an `id` field? → Script
+- ✅ Script: Did the task events show `tool.start` for `read_file`? → Script
+- ✅ Script: Does `GET /sessions/:id/messages` return token counts per message? → Script
+- ⚠️ AI Judge: Agent A was supposed to call Agent B with a meaningful instruction — the tool fired (HTTP 200, taskId returned) but was the actual message content semantically valid or empty/garbled? → AI Judge
+- ⚠️ AI Judge: Agent was given a task requiring it to read a file and report its contents — did it actually report content from the file, or did it say "I cannot read files"? (We are not testing model hallucination — we are testing that VeilCLI's tool pipeline delivered the file content to the agent correctly) → AI Judge
+- ⚠️ AI Judge: Agent completed a multi-step agentic flow — did it actually complete a coherent start → tool loop → end cycle, or did it stall/give up halfway? → AI Judge
+### What is NOT being tested
+- LLM response quality / intelligence / accuracy (the model is assumed capable — using Kimi K2.5)
+- Model hallucination
+- External service reliability (OpenRouter, web search availability)
+### AI Judge setup
+- Separate external model — NOT the same VeilCLI runtime being tested (avoids circular testing)
+- Called via direct API (not through VeilCLI's `/completions`)
+- Used sparingly — only where explicitly decided per test
+- Returns a structured verdict: PASS / FAIL + reasoning string
+---
+## 🏗️ Engine Architecture
+### Project location
+Inside the VeilCLI repo, as its own package:
+```
+VeilCLI/
+└── test-engine/
+    ├── package.json
+    ├── index.js              ← CLI entry point
+    ├── runner.js             ← test runner (serial within group, parallel across groups)
+    ├── workspace.js          ← workspace + server lifecycle manager
+    ├── client.js             ← HTTP test client (thin wrapper around fetch)
+    ├── assert.js             ← assertion helpers (standard + deep + AI judge)
+    ├── reporter.js           ← console output + failure artifact preservation
+    ├── fixtures/             ← reusable agent configs, memory seeds, tool files
+    └── tests/
+        ├── basic-api/
+        ├── flows/
+        ├── agent-behaviors/
+        ├── tool-coverage/
+        ├── debuggability/
+        ├── cli/
+        └── ai-judged/
+```
+### Workspace lifecycle
+```
+START RUN
+  → Create fresh temp folder: .veil-test-{timestamp}/
+  → Copy auth.json from project's .veil/auth.json
+  → Write settings.json (test-specific: low timeouts, test secret)
+  → Start VeilCLI server (child_process), wait for /health to respond
+  → Run all test groups
+  → On failure: preserve entire workspace + HTTP logs + AI judge transcripts
+  → On success: delete workspace
+  → Stop server
+END RUN
+```
+### Verification depth — per test, not global
+Each test explicitly defines what it checks:
+- **Level 1**: HTTP status + response shape
+- **Level 2**: Level 1 + specific field values
+- **Level 3**: Level 2 + task events (tool call sequence, parameters)
+- **Level 4**: Level 3 + file system / memory state / DB-visible state via API
+- **Level 5**: Level 4 + AI judge semantic verification
+### Console output philosophy
+Enough detail to understand what failed **without opening any files**. On failure, show:
+- Which test failed
+- What was expected vs. what was received
+- Which step in the flow failed (not just final assertion)
+- AI judge reasoning if applicable
+---
+## 📦 Fixtures
+**Reasoning**: Tests need agents, memory files, and custom tools to be pre-defined so scenarios are reproducible and readable. A base set is always present in every workspace run; test-specific additions are layered on top.
+The dev agent should look at `examples/` in the VeilCLI repo and the `agent.json` schema to understand what valid agent configs look like, then build fixture templates for:
+### Base fixtures (always installed)
+```
+fixtures/
+├── agents/
+│   ├── basic-chat/          ← minimal agent, chat mode only, no tools
+│   ├── task-runner/         ← task mode, has file + memory tools
+│   ├── memory-agent/        ← memory enabled, memory tools whitelisted
+│   ├── tool-tester/         ← all built-in tools whitelisted
+│   ├── restricted-agent/    ← specific tools explicitly denied
+│   ├── orchestrator/        ← can spawn subagents, agent_spawn + task_create
+│   └── worker/              ← subagent mode, spawned by orchestrator
+├── memory/
+│   ├── global-seed.md       ← pre-populated global memory for memory tests
+│   └── agent-seed.md        ← pre-populated agent memory
+├── tools/
+│   └── echo-tool.js         ← custom tool that just echoes input (for custom tool loading tests)
+└── files/
+    ├── sample.txt            ← readable test file for read_file tests
+    ├── sample-dir/           ← directory for list_dir / glob tests
+    └── grep-target.txt       ← file with known content for grep tests
+```
+**Dev agent instruction**: Look at the VeilCLI `schemas/` folder for the exact `agent.json` schema fields. Look at `tools/` for the structure of a built-in tool to understand how `echo-tool.js` should be shaped. Look at `examples/` for reference agent definitions.
+---
+## 🧪 Test Groups & Individual Tests
+For each test below, the reasoning explains:
+1. **Why this test exists** (what real failure it catches)
+2. **How to implement it** (what API calls, what to verify)
+3. **Verification level** (script / deep / AI judge)
+---
+### GROUP 1: Basic API
+*Reasoning: Before testing any behavior, validate that the HTTP surface is wired correctly. These tests have nothing to do with AI — they are pure contract tests. The previous basic scripts likely covered some of these, but may have missed field-level validation.*
+---
+**TEST 1.1 — Agent CRUD**
+- **Why**: Core API. If agent creation/read/update/delete is broken, nothing else works.
+- **How**:
+  1. `POST /agents` with a valid agent config → expect 201, response has `name` field matching input
+  2. `GET /agents` → expect array containing the created agent
+  3. `GET /agents/:name` → expect full config returned, verify key fields match what was sent
+  4. `PUT /agents/:name` → update a field (e.g. temperature), then GET again and verify it changed
+  5. `DELETE /agents/:name` → expect 200/204, then GET returns 404
+- **Verification**: Level 2 (HTTP + field values)
+- **Dev agent**: Check the exact response shape of each endpoint in `api/routes/` to know which fields to assert on
+---
+**TEST 1.2 — Settings CRUD**
+- **Why**: Settings affect the whole runtime. If read/write is broken, config changes don't take effect.
+- **How**:
+  1. `GET /settings` → verify response has expected top-level fields, API keys are redacted
+  2. `PUT /settings` with a safe field change (e.g. bump `maxIterations`) → expect 200
+  3. `GET /settings` again → verify the change is reflected
+  4. Test `?level=project` vs `?level=merged` → verify merged includes defaults
+- **Verification**: Level 2
+- **Dev agent**: Look at `utils/settings loader` to understand the merge layers and which fields are safe to mutate in tests
+---
+**TEST 1.3 — Health and Status**
+- **Why**: Basic liveness. Also `/status` returns counts that other tests can use to validate state changes.
+- **How**:
+  1. `GET /health` → 200, no DB access required (verify it's fast)
+  2. `GET /status` → verify fields: `uptime`, `cwd`, agent count, session count, task count
+  3. Create an agent, re-check `/status` → agent count increased
+- **Verification**: Level 2
+---
+**TEST 1.4 — Auth / Secret enforcement**
+- **Why**: If secret is set but not enforced, the security feature is broken silently.
+- **How**:
+  1. Start test server with `secret` set in settings
+  2. `GET /agents` without header → expect 401
+  3. `GET /agents` with wrong secret → expect 401
+  4. `GET /agents` with correct `X-Veil-Secret` → expect 200
+- **Verification**: Level 1
+---
+**TEST 1.5 — Session CRUD**
+- **Why**: Sessions are the backbone of chat continuity. Broken session management = broken chat history.
+- **How**:
+  1. `POST /sessions` pre-create → verify `id` returned
+  2. `GET /sessions` → list contains the session
+  3. `GET /sessions/:id` → verify fields: `agent`, `mode`, `status`, `message_count`
+  4. `POST /sessions/:id/reset` → verify messages cleared (message_count = 0 after)
+  5. `DELETE /sessions/:id` (soft) → verify status changed
+  6. `DELETE /sessions/:id?hard=true` → verify permanent removal
+- **Verification**: Level 2-3
+- **Dev agent**: Check the session model in `infrastructure/` SQLite schema to know exact field names
+---
+**TEST 1.6 — Models endpoint**
+- **Why**: `/models` is used internally for context size limits and cost calculation. If broken, cost tracking fails silently.
+- **How**:
+  1. `GET /models` → verify `updated_at` field exists, `models` is non-empty array
+  2. Each model has: `id`, `name`, `context_length`, `pricing`
+  3. `GET /models/:provider/:name` → look up a known model, verify fields
+- **Verification**: Level 2
+---
+### GROUP 2: Basic Flows
+*Reasoning: The project owner's primary failure — scripts passed but real usage broke. These tests simulate what a real new user would do. They are end-to-end flows, not unit tests. They catch integration failures that unit-level API tests miss.*
+---
+**TEST 2.1 — New user happy path (chat)**
+- **Why**: This is THE most critical regression test. A new user creates one agent and chats. If this breaks, VeilCLI is unusable.
+- **How**:
+  1. Use `basic-chat` fixture agent (already installed in workspace)
+  2. `POST /agents/:name/chat` with a simple message → expect 200, `message` field non-empty, `sessionId` returned
+  3. Send a follow-up using the returned `sessionId` → verify the session continues (message_count increases)
+  4. `GET /sessions/:id/messages` → verify both turns exist with correct roles
+  5. Verify `input_tokens` and `output_tokens` are non-zero on messages
+- **Verification**: Level 3
+- **Dev agent**: Check `api/routes/` for the exact chat response shape. Check how sessionId is passed in subsequent requests.
+---
+**TEST 2.2 — Async task full lifecycle**
+- **Why**: Task mode is async — the lifecycle (pending → processing → finished) must transition correctly. Previous scripts likely just checked 202 and didn't poll.
+- **How**:
+  1. `POST /agents/task-runner/task` with a simple input → expect 202, `taskId` returned, status `pending`
+  2. Poll `GET /tasks/:id` until status is `finished` or `failed` (with timeout — e.g. 60s)
+  3. Verify final status is `finished`, `output` is non-empty
+  4. `GET /tasks/:id/events` → verify events exist, at minimum one `status.change` event
+  5. Verify `token_input` and `token_output` are non-zero on the task record
+- **Verification**: Level 3
+- **Dev agent**: Check the task polling logic — what's a safe poll interval? Check event types in `api/routes/tasks`.
+---
+**TEST 2.3 — Chat with SSE streaming**
+- **Why**: SSE mode is a separate code path. If broken, streaming clients (UI) get no response while non-streaming works fine.
+- **How**:
+  1. `POST /agents/:name/chat` with `{ sse: true }` → expect `text/event-stream` content type
+  2. Collect all SSE events until `done` event received
+  3. Verify `chunk` events were received (content streamed)
+  4. Verify `done` event contains final message and `tokenUsage`
+- **Verification**: Level 2-3
+- **Dev agent**: Check how SSE is implemented in the chat route — what events are emitted and in what format.
+---
+**TEST 2.4 — Session resumption**
+- **Why**: A user closes their client and comes back later. The session must be resumable with history intact.
+- **How**:
+  1. Chat with agent, get `sessionId`
+  2. Send second message with same `sessionId` — verify context is maintained (agent references earlier turn)
+  3. Retrieve `GET /sessions/:id/messages` — verify full history is there
+  4. `POST /sessions/:id/reset` — verify messages cleared but session still exists
+  5. Chat again on same session — verify it works fresh
+- **Verification**: Level 3 (message history) + optionally Level 5 (AI judge: does agent actually reference the earlier message?)
+- **AI Judge note**: Use AI judge only to confirm context continuity — did the agent's response acknowledge the prior conversation? This is hard to script without semantic understanding.
+---
+### GROUP 3: Agent Behaviors
+*Reasoning: These tests validate that VeilCLI correctly enforces agent configuration — permissions, tool access, mode restrictions. These were identified as a major source of silent failures.*
+---
+**TEST 3.1 — Tool permission enforcement (deny list)**
+- **Why**: If `disallowedTools` doesn't actually block tool usage, security and behavior boundaries are broken. A script check on HTTP alone won't catch this — the agent might try the tool and get an internal error that still returns HTTP 200.
+- **How**:
+  1. Use `restricted-agent` fixture (has `bash` in `disallowedTools`)
+  2. Give agent a task that explicitly requires using `bash` ("run the command `echo hello`")
+  3. Poll task to completion
+  4. `GET /tasks/:id/events` → verify NO `tool.start` event for `bash` exists
+  5. Verify task either finished (agent worked around it) or failed with appropriate error — NOT silently ran bash
+- **Verification**: Level 3 (event log inspection)
+- **Dev agent**: Check how `disallowedTools` is enforced in `core/agentic loop`. Understand what event is emitted when a tool is blocked.
+---
+**TEST 3.2 — Tool whitelist enforcement (allow list)**
+- **Why**: Mirror of above. If `tools` whitelist doesn't restrict, agent can use any tool.
+- **How**:
+  1. Create agent with only `read_file` in its tools whitelist
+  2. Give it a task requiring `write_file`
+  3. Verify `write_file` never appears in task events
+- **Verification**: Level 3
+---
+**TEST 3.3 — Mode enforcement**
+- **Why**: An agent with `chat.enabled: false` should not be chattable. If mode enforcement is broken, wrong code paths execute.
+- **How**:
+  1. Create agent with `modes.chat.enabled: false`
+  2. `POST /agents/:name/chat` → expect appropriate error response (not 200)
+  3. Dev agent: check what error code/status VeilCLI returns for disabled modes
+- **Verification**: Level 1
+---
+**TEST 3.4 — maxIterations enforcement**
+- **Why**: Without iteration limits, a broken agentic loop runs forever and burns tokens/cost.
+- **How**:
+  1. Create task with `maxIterations: 2`
+  2. Give agent a task that would normally require many tool calls
+  3. Verify task stops after 2 iterations — check `iterations` field on task record
+  4. Verify `onExhausted: "fail"` → task status is `failed`; `onExhausted: "wait"` → task status is `waiting`
+- **Verification**: Level 3
+---
+**TEST 3.5 — Agent reload from disk**
+- **Why**: `POST /agents/:name/reload` must actually refresh config. If it doesn't, live config changes never take effect.
+- **How**:
+  1. Create agent, read its config
+  2. Directly modify the `agent.json` file on disk (change temperature)
+  3. `POST /agents/:name/reload`
+  4. `GET /agents/:name` → verify the changed field is now reflected
+- **Verification**: Level 2
+---
+### GROUP 4: Tool Coverage
+*Reasoning: There are 24 built-in tools. The previous scripts didn't test tool execution — they tested API endpoints. A tool can return HTTP 200 from the API but fail silently inside the agent loop. Each tool test gives an agent a task that REQUIRES that tool, then verifies via task events that the tool was called AND produced a meaningful result.*
+*Pattern for each tool test:*
+1. *Create a task that cannot be completed without using the specific tool*
+2. *Poll to completion*
+3. *Check `GET /tasks/:id/events` for `tool.start` + `tool.end` for that tool*
+4. *Check the `tool.end` event result is not an error and has meaningful content*
+5. *For some tools, verify side effects (file created, memory updated, etc.)*
+---
+**TEST 4.1 — File I/O tools**
+`read_file`:
+- Setup: Write a file to workspace with known content (e.g. `"TESTMARKER_XYZ"`)
+- Task: "Read the file at [path] and tell me what's in it"
+- Verify: `tool.end` result contains the file content, NOT an error string
+- AI Judge: Was the agent able to report the file content? (catches case where tool pipeline broke and agent said "I cannot read files")
+`write_file`:
+- Task: "Write the text 'hello test' to a file called output.txt in [dir]"
+- Verify: `tool.end` shows success, then actually check the file exists on disk with correct content
+`edit_file`:
+- Setup: Write a file with known content
+- Task: "Edit the file at [path], replace 'OLD_TEXT' with 'NEW_TEXT'"
+- Verify: File on disk now contains 'NEW_TEXT', not 'OLD_TEXT'
+`list_dir`:
+- Setup: Create a directory with 3 known files
+- Task: "List the files in [dir] and report the filenames"
+- Verify: `tool.end` result contains the expected filenames
+`glob`:
+- Setup: Create files with `.txt` and `.js` extensions in a dir
+- Task: "Find all `.txt` files in [dir]"
+- Verify: `tool.end` result matches only `.txt` files
+`grep`:
+- Setup: Create a file with known unique string
+- Task: "Search for the pattern 'UNIQUE_GREP_MARKER' in [dir]"
+- Verify: `tool.end` result contains the match with file + line reference
+`bash`:
+- Task: "Run the command `echo BASH_MARKER_TEST` and report the output"
+- Verify: `tool.end` result contains `BASH_MARKER_TEST`
+- This is a good deterministic test — the echo output is predictable
+---
+**TEST 4.2 — Memory tools**
+`memory_write` + `memory_read` (same test, sequential):
+- Task 1: "Write a note to your memory: 'MEMORY_MARKER_12345'"
+- Verify: `memory_write` appears in events, `tool.end` success
+- Also verify: `GET /agents/:name/memory/MEMORY.md` via API → file contains the written text
+- Task 2 (new session, same agent): "Read your memory and tell me what notes you have"
+- Verify: `memory_read` in events, AND agent's response contains `MEMORY_MARKER_12345`
+- AI Judge: Did the agent actually retrieve and report from memory, or did it say it has no memory? (Tests persistence across sessions)
+`memory_search`:
+- Setup: Pre-seed agent memory with several distinct entries
+- Task: "Search your memory for [specific topic]"
+- Verify: `memory_search` in events, result is relevant entry (not empty)
+---
+**TEST 4.3 — Todo tools**
+`todo_write` + `todo_read`:
+- Task: "Plan the following 3 tasks as todos: [list]" (agent should use todo_write naturally)
+- Verify: `todo_write` appears in events with structured `todos` array in parameters
+- Then: "What are your current todos?" (agent should use todo_read)
+- Verify: `todo_read` in events, response reflects the todos written
+- *Note: todo tools are scoped to the current task — verify this scoping works correctly*
+---
+**TEST 4.4 — Web tools**
+`web_search`:
+- Task: "Search the web for 'VeilCLI test engine'"
+- Verify: `web_search` in events, `tool.end` result is non-empty (not error)
+- *Note: This test may be flaky if DuckDuckGo is rate-limiting — mark it as non-blocking*
+`web_fetch`:
+- Task: "Fetch the content of https://example.com and tell me the page title"
+- Verify: `web_fetch` in events, result contains HTML-stripped text, not an error
+---
+**TEST 4.5 — Utility tools**
+`sleep`:
+- Task: "Wait 3 seconds then say done"
+- Verify: `sleep` in events with `seconds: 3`, task took at least 3 seconds
+`log_write`:
+- Task: "Write a log entry saying 'TEST_LOG_MARKER'"
+- Verify: `log_write` in events, then `GET /tasks/:id/events` → a `log` event contains `TEST_LOG_MARKER`
+`tool_search`:
+- Task: "Search your available tools for memory-related tools"
+- Verify: `tool_search` in events, result is non-empty
+---
+**TEST 4.6 — Multi-agent tools** *(also covered in Group 5, but event-level here)*
+`task_create`:
+- Task: "Create a new task for the worker agent with input 'say hello'"
+- Verify: `task_create` in events, `tool.end` contains a valid `taskId`
+- Then: poll that taskId via API → verify it actually exists and runs
+`task_status`:
+- Setup: Create a task programmatically via API
+- Task: "Check the status of task [id]"
+- Verify: `task_status` in events, result contains a valid status string
+`task_respond`:
+- Setup: Create a task with `onExhausted: "wait"`, let it reach waiting state
+- Task: Tell an agent "respond to task [id] with 'continue'"
+- Verify: `task_respond` in events, target task resumes
+`agent_message`:
+- Task: "Send a message to the worker agent asking it to say hello, wait for the reply"
+- Verify: `agent_message` in events, `tool.end` result is non-empty reply
+- AI Judge: Was the reply from the worker agent meaningful (not empty/error)?
+`agent_send`:
+- Task: "Send a fire-and-forget message to the worker agent"
+- Verify: `agent_send` in events, `tool.end` success (not expecting a reply)
+`agent_spawn (wait=true)`:
+- Task: "Spawn the worker agent to [do something], wait for its result, then report it"
+- Verify: `agent_spawn` in events with `wait: true`, result is non-empty, task has `parent_task_id` set
+- AI Judge: Did the orchestrator's final response actually incorporate the worker's output?
+`agent_spawn (wait=false)`:
+- Task: "Spawn 2 worker agents in parallel to [do something], collect their taskIds"
+- Verify: Two `agent_spawn` events with `wait: false`, two separate taskIds returned
+`task_subscribe`:
+- Task: Create a task, have agent subscribe to it
+- Verify: `task_subscribe` in events, subscription exists in DB (check via API behavior after task completes)
+---
+**TEST 4.7 — Custom tool loading**
+- **Why**: Custom tools in `.veil/agents/<name>/tools/` must be auto-discovered and available.
+- **How**:
+  1. Install `echo-tool.js` fixture into the test agent's tools folder
+  2. `GET /agents/:name/skills` → verify custom tool appears in the list
+  3. Give agent a task that calls the echo tool
+  4. Verify `tool.start` + `tool.end` in events for the custom tool name
+- **Dev agent**: Look at how custom tools are loaded in `core/agent loader` and what the `schema + execute` export shape must be
+---
+### GROUP 5: Multi-Agent Flows
+*Reasoning: Multi-agent communication is where the most subtle bugs hide. HTTP 200 is meaningless here — the tool can "succeed" but pass empty, truncated, or malformed data between agents. This is the primary group requiring AI judge.*
+---
+**TEST 5.1 — Orchestrator spawns worker (sync, wait=true)**
+- **Why**: The full sync delegation pattern. If `agent_spawn(wait=true)` passes an empty instruction or loses the result, the orchestrator gets nothing but no error is raised.
+- **How**:
+  1. Give orchestrator a task: "Spawn the worker agent and ask it to [specific task], then report exactly what it said"
+  2. Poll orchestrator task to completion
+  3. Check events: `agent_spawn` was called with `wait: true` and a non-empty `instruction`
+  4. Check worker task was created with a `parent_task_id` matching orchestrator task
+  5. Check orchestrator's output references the worker's result
+- **Verification**: Level 3 + AI Judge
+- **AI Judge**: "Given this orchestrator output and this worker output, did the orchestrator correctly incorporate the worker's response?"
+---
+**TEST 5.2 — Parallel fan-out (wait=false)**
+- **Why**: Parallel spawning is a complex pattern. Each `agent_spawn(wait=false)` must return a distinct `taskId` immediately, and both child tasks must actually run.
+- **How**:
+  1. Give orchestrator: "Spawn 2 worker agents in parallel with different tasks, collect their results"
+  2. Verify: Two distinct `agent_spawn` calls in events, each with `wait: false`
+  3. Verify: Two child tasks exist in DB with `parent_task_id` set
+  4. Verify: Both child tasks eventually reach `finished`
+  5. Verify: Orchestrator output references both results
+- **Verification**: Level 3 + AI Judge
+---
+**TEST 5.3 — Agent messaging (agent_message sync)**
+- **Why**: `agent_message` is synchronous — the caller blocks. If the target agent's response is lost or the call hangs, the calling agent stalls indefinitely.
+- **How**:
+  1. Give orchestrator: "Message the worker agent and ask it what 2+2 is, report the answer"
+  2. Verify: `agent_message` in events, `tool.end` has non-empty result
+  3. Verify: Orchestrator output contains the answer
+- **AI Judge**: Was the answer from the worker agent passed through correctly?
+---
+**TEST 5.4 — Durable task subscription**
+- **Why**: `task_subscribe` writes to SQLite and survives server restart. If this breaks, agents lose notification of subtask completion.
+- **How**:
+  1. Create a long-running task (use sleep tool)
+  2. Have subscriber agent call `task_subscribe` on it
+  3. Verify subscription exists (via runtime behavior — when task completes, subscriber gets notified)
+  4. Wait for target task to complete
+  5. Verify subscriber agent received notification (check its session events)
+- **Dev agent**: Check `task_subscriptions` table schema and how notifications are injected into subscriber sessions
+---
+**TEST 5.5 — maxSubAgentDepth enforcement**
+- **Why**: Without depth limits, a rogue agent can spawn infinitely deep chains.
+- **How**:
+  1. Set `maxSubAgentDepth: 2` in test settings
+  2. Create orchestrator that spawns a worker that spawns another worker
+  3. Verify: Third level spawn is rejected with appropriate error
+- **Dev agent**: Check what error/event is emitted when depth limit is exceeded in `core/`
+---
+### GROUP 6: Debuggability
+*Reasoning: The project owner specifically called this out. VeilCLI claims to provide full observability — token counts, cost, tool traces, conversation history. If these are missing or wrong, developers can't debug agent behavior in production. These tests verify the instrumentation works, not the agent behavior.*
+---
+**TEST 6.1 — Token tracking per message**
+- **Why**: Token counts must exist on every message for cost accountability.
+- **How**:
+  1. Run a chat conversation (2-3 turns)
+  2. `GET /sessions/:id/messages` → every assistant message has `output_tokens > 0`, every user message has `input_tokens > 0`
+  3. Verify no message has `null` or `0` tokens (unless it's a system message — dev agent: check if system messages are in the message list)
+- **Verification**: Level 2
+---
+**TEST 6.2 — Token tracking per session**
+- **Why**: Session-level totals must roll up correctly from message-level.
+- **How**:
+  1. After conversation above, `GET /sessions/:id`
+  2. Verify `token_input` and `token_output` are non-zero
+  3. Verify they are >= sum of individual message tokens (system prompt adds overhead)
+- **Verification**: Level 2
+---
+**TEST 6.3 — Cost calculation**
+- **Why**: Cost must be calculated even when the LLM provider doesn't return it natively (fallback to pricing model).
+- **How**:
+  1. After a conversation, `GET /sessions/:id` → verify `cost` is a non-zero number
+  2. `GET /sessions/:id/messages` → verify individual message `cost` fields exist
+  3. The session cost should approximately equal sum of message costs
+- **Verification**: Level 2
+---
+**TEST 6.4 — Task event trace completeness**
+- **Why**: Task events must capture the full agentic loop for debugging. If events are missing, you can't trace what happened.
+- **How**:
+  1. Run a task that uses at least 2 different tools
+  2. `GET /tasks/:id/events` → verify:
+     - At least one `status.change` event (pending → processing)
+     - One `tool.start` + `tool.end` pair per tool used
+     - Final `status.change` to `finished`
+  3. Verify `tool.start` events contain tool name and parameters
+  4. Verify `tool.end` events contain result (not empty)
+- **Verification**: Level 3
+---
+**TEST 6.5 — Context snapshot**
+- **Why**: `GET /tasks/:id/context` must return a valid LLM context snapshot for debugging mid-task.
+- **How**:
+  1. Run a task
+  2. `GET /tasks/:id/context` → verify `messages`, `tools`, `iteration` fields exist
+  3. `messages` array is non-empty, contains at least system + user turn
+  4. `tools` is non-empty array (tool schemas)
+- **Verification**: Level 2
+---
+**TEST 6.6 — Session message history**
+- **Why**: `GET /sessions/:id/messages` is the primary debugging tool for chat. Pagination must work.
+- **How**:
+  1. Create a chat session with 10+ turns
+  2. `GET /sessions/:id/messages?limit=3&offset=0` → verify exactly 3 messages returned
+  3. `GET /sessions/:id/messages?limit=3&offset=3` → verify next 3
+  4. Verify `role` field is either `user`, `assistant`, or `system` on each
+- **Verification**: Level 2
+---
+### GROUP 7: CLI Tool
+*Reasoning: VeilCLI has a CLI entry point (start, stop, status, agents, login). The CLI is a different code path than the REST API. If the CLI is broken, new users can't even start the server. These tests invoke the actual CLI binary via shell.*
+---
+**TEST 7.1 — Server start/stop via CLI**
+- **Why**: `veil start` and `veil stop` are the primary user entry points.
+- **How**:
+  1. Stop the test server (temporarily)
+  2. Run `veil start` in test workspace → wait for it to be ready (`GET /health` responds)
+  3. Run `veil status` → verify output contains running server info
+  4. Run `veil stop` → server goes down (`GET /health` fails)
+- **Dev agent**: Check `cli/` folder for exact command names and expected output format
+---
+**TEST 7.2 — Agent listing via CLI**
+- **How**:
+  1. With server running, run `veil agents` (or equivalent)
+  2. Verify output contains the test agents created in workspace
+- **Dev agent**: Check CLI command name and output format in `cli/`
+---
+**TEST 7.3 — CLI chat flow**
+- **Why**: A user should be able to start a chat from the CLI, get a response, and resume it.
+- **How**:
+  1. Run `veil chat <agent-name>` with piped input (or appropriate flag for non-interactive mode)
+  2. Verify response is printed to stdout
+  3. Note the session ID from output
+  4. Re-run with session flag to resume — verify agent acknowledges context
+- **Dev agent**: Check the `cli/` start command and whether there's a chat subcommand or how the CLI initiates chat. Look for non-interactive mode flags for testability.
+---
+### GROUP 8: Daemon Mode
+*Reasoning: Daemon mode runs on cron schedules. Tests can't wait for actual cron — use `POST /agents/:name/daemon/trigger` to fire immediately.*
+---
+**TEST 8.1 — Daemon start/stop/trigger**
+- **How**:
+  1. Create daemon agent with a schedule
+  2. `POST /agents/:name/daemon/start` → verify it appears in `GET /daemons`
+  3. `POST /agents/:name/daemon/trigger` → verify it fires (a new task is created)
+  4. Poll that task to completion
+  5. `POST /agents/:name/daemon/stop` → verify removed from `GET /daemons`
+- **Verification**: Level 3
+---
+**TEST 8.2 — Daemon conflict policy (skip)**
+- **How**:
+  1. Start daemon, trigger it
+  2. While first tick is still running, trigger again
+  3. With `conflictPolicy: "skip"` → second tick should be skipped (no second task created)
+- **Dev agent**: Check how conflict policy is implemented in `infrastructure/cron scheduler`
+---
+**TEST 8.3 — Daemon reads heartbeat file**
+- **Why**: Daemon agents read from `.veil/heartbeats/<name>.md` each tick for instructions. If this is broken, daemon behavior can't be configured at runtime.
+- **How**:
+  1. Write a specific instruction to the heartbeat file
+  2. Trigger daemon
+  3. AI Judge: Did the daemon's task output reflect the heartbeat instruction?
+- **Verification**: Level 3 + AI Judge
+---
+### GROUP 9: Memory Persistence
+*Reasoning: Memory is a markdown file on disk, injected into system prompts at session start. The key thing to test: does the agent's memory actually persist across sessions and influence behavior?*
+---
+**TEST 9.1 — Memory write + API read**
+- **How**:
+  1. Agent writes to memory via `memory_write` tool in a task
+  2. `GET /agents/:name/memory/MEMORY.md` → verify the written content appears
+  3. `PUT /agents/:name/memory/:file` via API → write directly
+  4. `GET /agents/:name/memory/:file` → verify roundtrip
+- **Verification**: Level 2-3
+---
+**TEST 9.2 — Memory influences next session**
+- **Why**: Memory is injected at session start. If injection is broken, agents don't have memory even if the file exists.
+- **How**:
+  1. Pre-seed agent memory with `"MEMORY_SEED_MARKER_ABC"`
+  2. Start a new chat session
+  3. Ask agent: "What do you remember about yourself?"
+  4. Verify agent mentions the marker
+- **AI Judge**: Did the agent's response demonstrate it had access to the memory content? (This is the canonical AI judge use case — a script can't verify semantic incorporation)
+- **Verification**: Level 5 (AI Judge)
+---
+**TEST 9.3 — Global vs agent memory scoping**
+- **How**:
+  1. Write to global memory (`PUT /memory/MEMORY.md`)
+  2. Write to agent-specific memory (`PUT /agents/:name/memory/MEMORY.md`)
+  3. Start chat with agent that has memory enabled
+  4. Verify both memories are injected (ask agent about both markers)
+- **AI Judge**: Did the agent demonstrate awareness of both global and agent-specific memory?
+---
+## 🔧 Implementation Notes for Dev Agent
+### How to detect tool usage in events
+Every agentic tool call produces a `tool.start` and `tool.end` event in `GET /tasks/:id/events`. The `tool.start` event contains the tool name and parameters. The `tool.end` event contains the result. **This is the primary verification mechanism for tool tests** — don't try to infer tool usage from the agent's text response.
+### Polling pattern for async tasks
+```javascript
+async function pollTask(taskId, timeoutMs = 60000) {
+  const start = Date.now();
+  while (Date.now() - start < timeoutMs) {
+    const task = await client.get(`/tasks/${taskId}`);
+    if (['finished', 'failed', 'canceled'].includes(task.status)) return task;
+    await sleep(2000);
+  }
+  throw new Error(`Task ${taskId} timed out`);
+}
+```
+### AI Judge invocation pattern
+```javascript
+// Only use when a script cannot verify semantically
+async function aiJudge(context, criteria) {
+  // Call external model API directly (not VeilCLI /completions)
+  // Prompt: "Given this [context], did [criteria]? Reply TEST_PASS or TEST_FAIL followed by one sentence reasoning."
+  // Parse response: starts with TEST_PASS → pass, TEST_FAIL → fail
+  // Log full prompt + response to failure artifacts always
+}
+```
+### On failure: preserve everything
+When any assertion fails:
+- Keep the entire test workspace folder (all agent files, DB, memory)
+- Dump all HTTP requests/responses made during that test
+- Dump AI judge prompts and responses
+- Print workspace path to console so developer can inspect directly
+### Test agent design
+- `basic-chat`: chat mode only, no tools, memory disabled — for pure API tests
+- `task-runner`: task mode, tools: `[read_file, write_file, bash, list_dir, glob, grep, sleep, log_write, todo_write, todo_read, tool_search]`
+- `memory-agent`: memory enabled, tools: `[memory_read, memory_write, memory_search]`
+- `orchestrator`: tools: `[agent_spawn, agent_message, agent_send, task_create, task_status, task_respond, task_subscribe]`, allowedAgents: `[worker]`
+- `worker`: subagent mode, tools: `[read_file, write_file, bash, memory_read, memory_write]`
+- `restricted-agent`: all tools denied except `read_file`
+- `daemon-agent`: daemon mode with a schedule, reads heartbeat
+### Dev agent: things to check in the codebase before implementing
+1. `api/routes/` — exact request/response shapes for every endpoint (don't guess field names)
+2. `core/agentic loop` — how iterations work, how tools are called, how events are emitted
+3. `infrastructure/` — SQLite schema (field names on tasks, sessions, messages, events)
+4. `cli/` — exact command names and whether non-interactive mode exists for chat
+5. `schemas/agent.json` — exact valid fields for agent configs
+6. `tools/` — each tool's schema to understand what parameters and return values look like
+7. `system-prompts/` — what's injected automatically so you don't duplicate in test agents
+---
+## ✅ Summary of All Decisions Made in Research Session
+| Decision | Value |
+|---|---|
+| Engine type | Test suite + validation runtime |
+| Location | Inside VeilCLI repo, own package |
+| Entry point | CLI (`node test-engine.js`, group filters) |
+| Server lifecycle | Auto-start/stop per run |
+| Workspace | Fresh per run, preserved on failure |
+| Parallelism | Parallel across groups, serial within group |
+| AI Judge model | External model, NOT VeilCLI `/completions` |
+| AI Judge usage | Only when scripts produce false positives |
+| AI Judge trigger | Manually decided per test |
+| Fixture strategy | Reusable base set + test-specific additions |
+| Test authorship | Hardcoded in engine code, add by coding |
+| Verification depth | Per-test, explicitly defined |
+| LLM calls | Real (no mocking), tolerate variability |
+| Reproducibility | 3 identical runs should all pass |
+| Output | Console: enough detail to understand failure without opening files |
+| Failure artifacts | Workspace + HTTP logs + AI judge transcripts |
+| CLI tests | Via shell exec of actual `veil` binary |