npm - @k71n/agent-probe - Versions diffs - 0.1.0 - Mend

@k71n/agent-probe 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

package/LICENSE +21 -0
package/README.md +324 -0
package/dist/assets/playbook.md +113 -0
package/dist/assets/skill/SKILL.md +32 -0
package/dist/cleanup/cleanup-verify.js +233 -0
package/dist/constants.js +55 -0
package/dist/evidence/diff.js +0 -0
package/dist/evidence/evidence-store.js +238 -0
package/dist/evidence/query.js +97 -0
package/dist/index.js +16 -0
package/dist/ingest/contract.js +94 -0
package/dist/ingest/ingest.js +211 -0
package/dist/logger.js +18 -0
package/dist/node-version.js +18 -0
package/dist/server.js +90 -0
package/dist/session/run-boundaries.js +44 -0
package/dist/session/session-manager.js +354 -0
package/dist/session/state-dir.js +26 -0
package/dist/tools.js +242 -0
package/package.json +38 -0

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Catalinm
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,324 @@
+# agent-probe
+**Evidence-based, leave-no-trace debugging for AI coding agents — an MCP server that lets your agent place temporary probes in running code, capture runtime evidence across services, name the root cause, and verifiably clean up after itself.**
+agent-probe is the local-dev counterpart to production AI debugging. Your agent (Claude Code, Cursor, or any MCP host) drives the whole loop: instrument → reproduce → analyze → remove → verify. Evidence lives in a per-session local SQLite file that is **destroyed when the session ends**. No SDK in your app, no accounts, no dashboards, and no network egress, ever.
+---
+## Why agent-probe?
+The hardest bugs are invisible from the code alone: a write succeeds, the dependent read shows nothing, and no error appears anywhere. Static analysis and grepping can't see runtime state — and `console.log` debugging by an agent leaves litter in your tree and noise in your terminal.
+- **Probes are one-liners, not a library** — a self-contained, fire-and-forget HTTP POST to localhost, wrapped in marker comments. Your app gains zero dependencies.
+- **Probing never perturbs the app** — the server replies 202 *before* validating anything, applies no backpressure, and a dead server changes nothing (proven by an explicit behavior-equivalence test).
+- **Evidence is structured, not scrollback** — timelines across services, bounded queries, and a first-class *diff between a failing run and a working run* that surfaces the discriminating difference directly.
+- **Cleanup is verified, never assumed** — the server scans your workspace itself and *refuses to close the session* while any probe marker remains. `git diff` ends empty.
+- **Everything stays on your machine** — localhost-only ingestion (bound at the kernel level), ephemeral per-session storage, zero telemetry.
+- **Language-agnostic by contract** — anything that can POST JSON conforms: JS/TS, Python, shell/curl, Go, SQL comments for markers, and more.
+---
+## Features
+### The debugging loop
+- **Goal-scoped sessions** — `start_session(goal, workspace_root)` opens an isolated investigation with its own evidence store and ingestion port
+- **Runs as first-class boundaries** — tag reproductions (`"buggy"`, `"clean"`), and events are attributed to the run they arrived in; out-of-run events are kept as `"unattributed"`, never lost
+- **Cross-service timelines** — time-ordered events from every service in one view, with honest `seq_tied` flags when ordering rests on arrival rather than causality
+- **Bounded span queries** — filter by run, probe, service, or time range; results are capped server-side with `truncated` + true `total`, so a context window holds hypotheses, not haystacks
+- **Run diffing (the wedge)** — structured presence/absence, ordering inversions, and payload deltas between two runs
+### Non-perturbation guarantees
+- **202-before-validate ingestion** — a probe never waits on parsing, validation, or storage
+- **Fire-and-forget idiom** — no `await`, no retries, no queues; errors are swallowed at the probe (`.catch(() => {})`)
+- **Caps instead of pressure** — oversized payloads are truncated (event kept), events past the per-session cap are dropped with a warning; the app never feels any of it
+- **Warnings surface on every tool response** — rejected events, truncations, and drops are reported to the agent; silence about dropped evidence would mislead the analysis
+### Leave no trace
+- **Marker convention** — every probe is wrapped in own-line comment pairs (`// agent-probe-begin p1` … `// agent-probe-end p1`) that survive formatters and are mechanically removable
+- **Server-side cleanup verification** — `verify_cleanup` scans the workspace (git-aware; `node_modules/` and `.git/` always skipped) and returns exact file+line locations
+- **The close gate** — `end_session` is refused with `MARKERS_REMAIN` while markers exist; the override is explicit, user-owned, and always reported in the result
+- **Destructive close** — the session's SQLite file and lock are deleted; orphans from crashes are surfaced and disposed explicitly, never silently resumed
+### Agent guidance, token-cheap
+- **Pull-once playbook** — conventions, the wire contract, probe strategy, and the removal ritual ship as an MCP resource (`playbook://probes`) read once per session
+- **Optional Agent Skill** — skill-capable hosts get progressive-disclosure triggering (~100 idle tokens); it routes to the playbook and is never required
+---
+## Quick start
+### Install
+One MCP config entry. Requires **Node >= 22.13** (the server tells you on stderr if yours is older — it uses the built-in `node:sqlite`).
+**Claude Code**
+```sh
+claude mcp add agent-probe -- npx -y @k71n/agent-probe@latest
+```
+**Cursor** (`.cursor/mcp.json`)
+```json
+{
+  "mcpServers": {
+    "agent-probe": {
+      "command": "npx",
+      "args": ["-y", "@k71n/agent-probe@latest"]
+    }
+  }
+}
+```
+Note that `npx` caches versions: use `@latest` as above, or pin a version. The npm package is scoped (`@k71n/agent-probe`); the tool, bin, and probe markers are plain `agent-probe`.
+### Try it: the golden demo
+The repo ships a tiny two-layer app staging the classic silent failure — *the form saves, but the list never updates*. No errors anywhere.
+```sh
+node examples/golden-demo/api/server.mjs
+# open http://localhost:4280 — save an entry, watch it never appear
+```
+Then tell your agent:
+> The form saves but the list never updates — find out why.
+What happens next:
+1. The agent starts a session and pulls the playbook once.
+2. It inserts marker-wrapped probe lines on both sides of the data boundary (write path + read path).
+3. You reproduce the bug twice — once failing, once working — while the agent captures both runs.
+4. The timeline names the root cause: the entry was **written with `categoryId: null`** while the list **filtered on a category** — a silent field-name mismatch between frontend and backend.
+5. The agent removes every probe, the server verifies the workspace is clean, and the session is destroyed with all its evidence. `git diff` is empty.
+---
+## Tool reference
+The surface is deliberately frozen at **9 tools**.
+| Tool | Arguments | Description |
+|------|-----------|-------------|
+| `start_session` | `goal`, `workspace_root`, `stale?` | Open a Debug Session scoped to a stated goal. Returns `session_id`, the ingestion `port`, the echoed `workspace_root`, and the playbook URI. Refused with `STALE_SESSION_EXISTS` if orphans exist (resolve with `stale: "dispose"` or `"keep"`), or `INSTANCE_CONFLICT` if another live process holds the workspace. |
+| `start_run` | `tag?` | Arm a Run while the user reproduces the flow. With host elicitation support, waits for the user's confirmation and returns the closed run; otherwise returns `status: "open"` and `end_run` closes it. |
+| `end_run` | `tag?` | Close the open Run (a tag here overwrites one set at start). |
+| `list_runs` | — | Every run with tag, boundaries, and event count. |
+| `get_timeline` | `run`, `limit?` | Time-ordered events across all services for a run (or `"unattributed"`). Same-millisecond events are flagged `seq_tied`. |
+| `get_span` | `run?`, `probe?`, `service?`, `from?`, `to?`, `limit?` | Bounded slice by any filter combination. |
+| `diff_runs` | `a`, `b` | Structured differences between two runs: presence/absence, ordering changes, payload deltas. |
+| `verify_cleanup` | — | Server-side scan of the workspace for residual probe markers; returns exact locations and orphan (unpaired) markers separately. |
+| `end_session` | `override?` | Destroy the session and all stored evidence. Refused with `MARKERS_REMAIN` while markers remain; `override: true` is the user's call and is always reported in the result. |
+### Response envelope
+Every successful response is the same shape — even single-item ones:
+```json
+{ "data": { ... }, "truncated": false, "total": 1, "warnings": ["..."] }
+```
+`truncated`/`total` make result capping honest; `warnings` (when present) carry ingestion rejections, payload truncations, and drops.
+### Errors
+Every tool error is `{ code, message, hint }` with `code` drawn from a closed enum:
+`NO_ACTIVE_SESSION` · `SESSION_ALREADY_ACTIVE` · `STALE_SESSION_EXISTS` · `INSTANCE_CONFLICT` · `MARKERS_REMAIN` · `RUN_NOT_FOUND` · `NO_ACTIVE_RUN` · `INVALID_STATE`
+The `hint` always says what to do next — errors are written for agents.
+---
+## The probe contract
+Probes POST JSON to `http://127.0.0.1:<port>/events` (the port comes from `start_session`). snake_case on the wire; optional fields are **omitted** when absent, never null.
+| Field | Type | Required | Meaning |
+|-------|------|----------|---------|
+| `session_id` | string | yes | from the `start_session` response |
+| `probe_id` | string | yes | matches the probe's marker id |
+| `service` | string | yes | which service emitted it (`"web"`, `"api"`, …) |
+| `file` | string | yes | source file the probe lives in |
+| `line` | int | yes | source line |
+| `ts_probe` | int | yes | epoch **milliseconds** at emission |
+| `payload` | JSON | yes | any JSON value; keep it ≤ 64 KiB (truncated beyond) |
+| `trace_id` | string | no | correlation headroom (W3C-aligned, free-form) |
+| `parent_id` | string | no | correlation headroom (W3C-aligned, free-form) |
+The server assigns `ts_server` and a monotonic `seq` on arrival. Unknown extra keys are ignored — strictness would break language-agnosticism. No `Content-Type` required.
+### The idiom (JS/TS)
+```ts
+// agent-probe-begin p1
+fetch(`http://127.0.0.1:${PORT}/events`, { method: "POST", body: JSON.stringify({ session_id: SID, probe_id: "p1", service: "api", file: "list.ts", line: 42, ts_probe: Date.now(), payload: { categoryId, rowCount } }) }).catch(() => {});
+// agent-probe-end p1
+```
+Markers are **own-line** comment pairs (they survive Prettier/ESLint), both carrying the probe id; the comment leader adapts to the language (`//`, `#`, `--`, `<!-- -->`). Anything that can POST JSON conforms — shell:
+```sh
+curl -s -X POST http://127.0.0.1:$PORT/events -d "{\"session_id\":\"$SID\",\"probe_id\":\"p3\",\"service\":\"worker\",\"file\":\"job.sh\",\"line\":7,\"ts_probe\":$(date +%s%3N),\"payload\":{\"jobId\":\"$JOB\"}}" >/dev/null 2>&1 &
+```
+The full conventions — probe strategy for silent write/read failures, the removal ritual, run protocol — live in the bundled playbook (`playbook://probes`), which agents pull once per session.
+### Limits (enforced, not asserted)
+| Cap | Value | Behavior past it |
+|-----|-------|------------------|
+| Payload size | 64 KiB | truncated, event kept, warning emitted |
+| Events per session | 100,000 | dropped with warning |
+| Events per query result | 500 | clamped; `truncated: true` + true `total` |
+---
+## Architecture
+```
+        MCP host (Claude Code, Cursor, …)
+                      |
+               stdio (JSON-RPC)            your services
+                      |                  (web, api, worker)
+            +---------v----------+               |
+            |     MCP server     |        marker-wrapped
+            |  9 tools + the     |        one-line probes
+            |  playbook resource |               |
+            +---------+----------+      POST /events (fire-and-forget)
+                      |                          |
+        +-------------+--------------+   +-------v--------+
+        |       SessionManager       |   | Ingest listener |
+        |  state machine, lockfiles, |<--+ 127.0.0.1:ephem |
+        |  orphan recovery, close    |   | 202-before-     |
+        |  gate (MARKERS_REMAIN)     |   | validate        |
+        +------+--------------+------+   +----------------+
+               |              |
+     +---------v---+   +------v----------+
+     | Evidence    |   | Cleanup verify  |
+     | store + query|  | (workspace scan,|
+     | + run diff  |   |  marker pairing)|
+     | (SQLite,    |   +-----------------+
+     |  per-session,|
+     |  destroyed   |
+     |  on close)   |
+     +-------------+
+```
+### How a session works
+1. **`start_session`** validates the workspace, takes a PID lockfile (one active session per workspace), creates a per-session SQLite file, and returns the ingestion port + playbook URI.
+2. **The agent instruments** — it inserts whole probe lines into your code, baking the session id and port in as literals. The server never edits your files.
+3. **Ingestion** replies `202` immediately, then parses, validates against the contract, applies caps, and flushes to SQLite once per event-loop tick in a single transaction. Events arriving while a run is open are attributed to it.
+4. **Analysis** runs entirely over the bounded query tools — one shared ordering comparator (`ts_probe`, then `seq`) backs the timeline, spans, and the run diff.
+5. **Cleanup** is a trust split: the *server* reports exact marker locations (fresh scan every time), the *agent* deletes the marked ranges, the server re-verifies. `end_session` only succeeds clean — then closes and unlinks everything.
+### Crash safety
+There is no state outside the session's SQLite file and its lockfile. Sudden death leaves only an orphan `.db`; the next `start_session` surfaces it (`STALE_SESSION_EXISTS`, with the goal of the lost investigation in the hint) and the user decides: dispose or keep. Orphans are never silently resumed.
+### State location
+Per-user, per-platform (override with `XDG_STATE_HOME`):
+| Platform | Path |
+|----------|------|
+| Linux | `~/.local/state/agent-probe/` |
+| macOS | `~/Library/Application Support/agent-probe/` |
+| Windows | `%LOCALAPPDATA%/agent-probe/` |
+Inside: `sessions/<session-id>.db` (deleted on close) and `locks/` (PID lockfiles, reaped when stale).
+---
+## Development
+### Setup
+```sh
+git clone <this repo>
+cd agent-probe
+npm ci
+```
+### Scripts
+| Command | What it does |
+|---------|--------------|
+| `npm run dev` | Run the server from source (tsx, stdio) |
+| `npm test` | Full vitest suite (unit + integration) |
+| `npm run test:watch` | Watch mode |
+| `npm run typecheck` | `tsc --noEmit` (strict, `noUncheckedIndexedAccess`) |
+| `npm run lint` | ESLint over the solution |
+| `npm run build` | `tsc` + inject prose assets into `dist/` |
+### Repository layout
+```
+src/
+  index.ts              entry point (Node version gate, then stdio server)
+  server.ts             MCP server wiring + the playbook resource
+  tools.ts              the 9 tool registrations (the ONE place tools exist)
+  constants.ts          single source for names, limits, error codes
+  logger.ts             stderr-only logging (stdout belongs to MCP framing)
+  session/              session state machine, run boundaries, lockfiles, state dirs
+  ingest/               POST /events listener, wire contract, caps, warnings
+  evidence/             per-session SQLite store, timeline/span queries, run diff
+  cleanup/              marker scanning, pairing, removal-range derivation
+  integration/          cross-module flow tests + the golden-demo fixture
+assets/
+  playbook.md           the agent playbook (template; injected at build)
+  skill/SKILL.md        optional Agent Skill (template; injected at build)
+examples/
+  golden-demo/          runnable demo app with the staged silent-failure bug
+build-playbook.mjs      build-time placeholder injection (+ drift guards)
+.github/workflows/      ci.yml (tests, greps, packaging gates), release.yml (npm publish)
+```
+### Design rules the code enforces
+- **stdout is sacred** — it carries MCP framing only; all diagnostics go to stderr. Lint + CI grep enforce it.
+- **No egress** — the only networking is the localhost ingestion listener; net-client imports are lint-banned, URLs are CI-grepped.
+- **Frozen runtime deps** — `@modelcontextprotocol/sdk` + `zod`, nothing else.
+- **Single-sourced names** — the package name and marker token live in `src/constants.ts`; prose templates use placeholders injected at build, and CI fails on survivors.
+- **Tools never touch SQL** — strict layering: tools → SessionManager → EvidenceStore.
+See [CONTRIBUTING.md](CONTRIBUTING.md) for the full list and the PR process.
+### Releases
+Pushing a `v*` tag runs `release.yml`: typecheck → tests → lint → build → pack-install smoke (tarball contents + the installed bin must boot) → `npm publish` via OIDC trusted publishing (no tokens, provenance attached automatically).
+---
+## 3-minute demo
+<!-- DEMO_RECORDING_URL -->
+*Recording coming with launch.*
+---
+## Known limitations
+Honest ones, each with its mitigation:
+1. **Timing-sensitive bugs may shift under instrumentation.** Behavior equivalence is proven for application-visible *outputs* (see `src/integration/behavior-equivalence.test.ts`), not microsecond timing. If your bug is a sub-millisecond race, probes can move it.
+2. **Git hygiene: don't commit probed code mid-session.** Markers are plain greppable comments and `verify_cleanup` is the gate — but nothing stops a `git commit` while probes are in place. Finish the loop before committing.
+3. **Container/WSL2 clock skew can disorder cross-service timelines.** `ts_probe` is each app's own clock. Keep services on one clock domain, or read `seq_tied` flags honestly — arrival order is not causality.
+4. **`workspace_root` is agent-supplied and trusted (v1).** The cleanup scan verifies everything *within* it; it does not verify that it *is* your project root. Glance at the echoed path when the session starts.
+5. **One active session per workspace (v1).** A second session against the same workspace is refused while the first holds the lock; stale sessions from crashes are surfaced and disposed explicitly.
+6. **Non-git workspaces over-scan.** Without git's ignore rules the cleanup scan walks everything under `workspace_root`, so stale markers in build output may surface. They're real markers — delete the build artifacts or rebuild.
+---
+## Contributing
+Contributions are welcome — see [CONTRIBUTING.md](CONTRIBUTING.md) for setup, the project's invariants, and the PR process. All participants are expected to follow the [Code of Conduct](CODE_OF_CONDUCT.md).
+Found a security issue? Please report it privately — see [SECURITY.md](SECURITY.md).
+## License
+[MIT](LICENSE)

package/dist/assets/playbook.md ADDED Viewed

@@ -0,0 +1,113 @@
+# Probe Playbook
+You are the debugger. The server stores and queries probe evidence; YOU place
+and remove probes. Read this once per session — everything you need is here.
+Never paste this content into tool calls; act on it.
+## Session flow
+1. `start_session(goal, workspace_root)` → returns `session_id`, `port`, echoed `workspace_root`. Bake both values into every probe.
+2. Pull this playbook (you just did — don't pull it again).
+3. Instrument: place probes in the code under `workspace_root` (conventions below).
+4. `start_run(tag)` → the user reproduces the bug → the run closes (user confirmation, or call `end_run`).
+5. Analyze: `list_runs`, `get_timeline`, `get_span`, `diff_runs`.
+6. Remove every probe, then `verify_cleanup` until `clean: true` (ritual below).
+7. `end_session` — destroys all stored evidence. Refused with `MARKERS_REMAIN` while markers remain.
+## Probe conventions — markers are the ground truth
+Every probe is wrapped in an own-line comment pair carrying the same probe-id:
+```ts
+// agent-probe-begin p1
+fetch(`http://127.0.0.1:${PORT}/events`, { method: "POST", body: JSON.stringify({ session_id: SID, probe_id: "p1", service: "api", file: "list.ts", line: 42, ts_probe: Date.now(), payload: { categoryId, rowCount } }) }).catch(() => {});
+// agent-probe-end p1
+```
+Rules — each one exists because something breaks without it:
+- Markers are OWN-LINE comments (never trailing a code line) wrapping COMPLETE statements. Own-line is what survives Prettier/ESLint reformatting.
+- `begin` AND `end` both carry the probe-id. Unique probe-id per probe.
+- The comment leader adapts to the language: `// agent-probe-begin p2` (JS/TS/Go), `# agent-probe-begin p2` (Python/shell), `-- agent-probe-begin p2` (SQL), `<!-- agent-probe-begin p2 -->` (HTML).
+- Probes go ONLY inside `workspace_root` — the server's cleanup verifier scans nothing else, so a probe outside it can never be verified removed.
+- Never edit existing lines to insert a probe; INSERT whole new lines. Removal then restores the file byte-for-byte.
+## Event contract v1 (the wire shape)
+`POST http://127.0.0.1:<port>/events` with a JSON body. snake_case on the wire. Optional fields are OMITTED when absent — never null.
+| field      | type   | required | meaning                                          |
+|------------|--------|----------|--------------------------------------------------|
+| session_id | string | yes      | from the start_session response                  |
+| probe_id   | string | yes      | matches the probe's marker id                    |
+| service    | string | yes      | which service emitted it ("web", "api", …)       |
+| file       | string | yes      | source file the probe lives in                   |
+| line       | int    | yes      | source line                                      |
+| ts_probe   | int    | yes      | epoch MILLISECONDS at emission                   |
+| payload    | JSON   | yes      | any JSON value; keep it ≤ 64 KiB (truncated beyond) |
+| trace_id   | string | no       | W3C Trace Context headroom                       |
+| parent_id  | string | no       | W3C Trace Context headroom                       |
+The server assigns `ts_server` and a monotonic `seq` on ingestion. Epoch-ms is a one-liner everywhere: `Date.now()` (JS/TS), `int(time.time() * 1000)` (Python), `$(date +%s%3N)` (shell), `time.Now().UnixMilli()` (Go).
+## Fire-and-forget — never perturb the app
+The probe idiom, verbatim — non-awaited, error-swallowed:
+```ts
+fetch(`http://127.0.0.1:${PORT}/events`, { method: "POST", body: JSON.stringify({ ...event }) }).catch(() => {});
+```
+- NO `await`, NO retries, NO timeouts-with-retry, NO queues.
+- NO shared logging helpers or wrapper functions — a probe is a self-contained one-liner, or it isn't mechanically removable and isn't language-agnostic.
+- Browser code near a page unload/navigation: add `keepalive: true` so the request survives.
+- `PORT` and `SID` come from the `start_session` response — bake the literal values in.
+Other languages, same contract (anything that can POST JSON conforms; `Content-Type` not required):
+```sh
+curl -s -X POST http://127.0.0.1:$PORT/events -d "{\"session_id\":\"$SID\",\"probe_id\":\"p3\",\"service\":\"worker\",\"file\":\"job.sh\",\"line\":7,\"ts_probe\":$(date +%s%3N),\"payload\":{\"jobId\":\"$JOB\"}}" >/dev/null 2>&1 &
+```
+```python
+# agent-probe-begin p4
+import json, urllib.request
+try: urllib.request.urlopen(urllib.request.Request("http://127.0.0.1:PORT/events", data=json.dumps({"session_id": SID, "probe_id": "p4", "service": "api", "file": "views.py", "line": 88, "ts_probe": int(time.time() * 1000), "payload": {"query": str(qs.query)}}).encode()), timeout=1)
+except Exception: pass
+# agent-probe-end p4
+```
+## Probe strategy: the silent write-path/read-path failure
+The classic: a write succeeds, the dependent read shows nothing, no error anywhere. Strategy:
+- Probe BOTH sides of the data boundary. Write side: what was persisted, with which discriminating values (ids, foreign keys, flags, tenant/category fields). Read side: what the query filtered on, and what came back (row count, first row).
+- Put the DISCRIMINATING fields in `payload` — the fields that decide matching (the category id written vs the category id queried), not entire entities.
+- Capture two runs: `start_run(tag: "buggy")` for the failing path, `start_run(tag: "clean")` for a working comparison (a case that does show up).
+- `diff_runs(a, b)` then surfaces presence/absence, ordering changes, and payload deltas directly — you rarely need to read either timeline whole.
+## Run protocol
+- `start_run` may wait for the user's confirmation (the host shows a prompt: reproduce now, then confirm) — or it returns `status: "open"` and you call `end_run` when the user says the reproduction is done.
+- Tag runs (`"buggy"`, `"clean"`); a tag at `end_run` overwrites the start tag.
+- Events arriving OUTSIDE any run are kept, not lost: query them with `run: "unattributed"`.
+## Analysis
+- `list_runs` — every run with tag, boundaries, event count.
+- `get_timeline(run)` — time-ordered events across all services. Events flagged `seq_tied` arrived inside the same millisecond: their order rests on arrival, NOT on causality. Never read jitter as causality.
+- `get_span(filters)` — bounded slice by run/probe/service/time range. Results are capped; `truncated: true` + `total` tell you when to narrow the filter.
+- `diff_runs(a, b)` — the structured account of differences between two runs.
+## Cleanup ritual — leave no trace, verified
+1. Run `verify_cleanup` — the server scans `workspace_root` itself and returns the EXACT file+line of every marker. Trust this list, not your memory.
+2. Per file, delete each marked range INCLUSIVE (begin line through end line, whole lines — never blank them), working bottom-up (highest start line first) so earlier line numbers stay valid.
+3. Re-run `verify_cleanup`. Locations from before any edit are stale — always re-scan, then remove. Repeat until `clean: true`.
+4. Orphans (a begin without its end, or vice versa) are reported separately — fix them by hand; never guess a deletion range from an orphan.
+5. `end_session`. If it returns `MARKERS_REMAIN`, markers still exist — go to 1. `override: true` exists, but it is the USER's decision: ask first, and tell them the override (with the residual locations from the result) — it is always reported, never silent. Evidence is destroyed either way.
+## Observability
+- Any tool response may carry `warnings`: rejected events, payload truncations, drops past caps, stale sessions. Read them — silence about dropped evidence would mislead your analysis.
+- No active session → posted events are rejected. Events with a stale `session_id` (from a previous session's probes) are rejected too: stale probes can't pollute a new investigation.

package/dist/assets/skill/SKILL.md ADDED Viewed

@@ -0,0 +1,32 @@
+---
+name: agent-probe
+description: Drives evidence-based debugging of multi-service applications through the agent-probe MCP server - places temporary probes, captures runtime evidence across services, compares runs, and verifies clean removal. Use when the user says "debug this", asks why something happens at runtime, or hunts a bug spanning more than one service or process.
+---
+# agent-probe — evidence-based multi-service debugging
+## When to use (and when not to)
+Use this skill when a bug's cause is invisible from the code alone: runtime behavior across services, silent failures, "the write succeeds but the read shows nothing". Do NOT use it for static questions (type errors, syntax, refactoring) or when a single stack trace already explains the failure.
+## The session loop
+1. `start_session(goal, workspace_root)` — note the returned `session_id` and `port`.
+2. Pull the playbook resource `playbook://probes` ONCE — it carries every convention you need.
+3. Instrument: place probes in the code, following the playbook's conventions exactly.
+4. `start_run(tag)` → the user reproduces the bug → the run closes (user confirmation, or `end_run`). Capture a failing run AND a working comparison run when possible.
+5. Analyze: `list_runs`, `get_timeline`, `get_span`, `diff_runs`.
+6. Remove every probe, then `verify_cleanup` until it reports clean.
+7. `end_session` — destroys all stored evidence.
+## Standing disciplines
+- A probe is a self-contained fire-and-forget one-liner — no shared helpers, no awaits, no retries.
+- Every probe is wrapped in an own-line marker comment pair carrying a unique probe id.
+- Probes go ONLY inside `workspace_root` — nothing outside it can ever be verified removed.
+- Cleanup is verified, never assumed — re-scan until the server reports clean.
+- Ask the user before any `override`, and report it — it destroys evidence while markers remain.
+## Authority
+The playbook resource (`playbook://probes`) is the source of truth for conventions, the event contract, and removal mechanics — pull it before placing any probe. If anything here seems to conflict with it, the playbook wins.