npm - incantx - Versions diffs - 0.1.0 - Mend

incantx 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

package/README.md +254 -0
package/dist/cli.js +460 -0
package/dist/cli.js.map +16 -0
package/dist/index.d.ts +4 -0
package/dist/index.js +473 -0
package/dist/index.js.map +17 -0
package/dist/src/agent/exampleJsonlAgent.d.ts +1 -0
package/dist/src/agent/index.d.ts +3 -0
package/dist/src/agent/mockWeatherAgent.d.ts +5 -0
package/dist/src/agent/openaiChatAgent.d.ts +7 -0
package/dist/src/agent/types.d.ts +48 -0
package/dist/src/cli.d.ts +2 -0
package/dist/src/fixture/index.d.ts +1 -0
package/dist/src/fixture/load.d.ts +2 -0
package/dist/src/fixture/types.d.ts +31 -0
package/dist/src/judge/openaiJudge.d.ts +11 -0
package/dist/src/runner/deepMatch.d.ts +1 -0
package/dist/src/runner/expectations.d.ts +16 -0
package/dist/src/runner/runFixtureFile.d.ts +17 -0
package/dist/src/runner/subprocessAgent.d.ts +3 -0
package/package.json +46 -0

package/README.md ADDED Viewed

@@ -0,0 +1,254 @@
+# incantx
+Test agent conversation flows (including tool calls) using declarative YAML fixtures.
+This repo is intentionally **early-stage**: the runner/CLI works for “next message” assertions (including tool calls), while multi-step tool-execution loops and richer reporting are still in progress.
+## Getting started
+```bash
+bun install
+bun src/cli.ts tests/fixtures
+```
+## Installation
+incantx’s CLI is a Bun script (`#!/usr/bin/env bun`), so you need Bun installed even if you install incantx via npm.
+Global install (choose one):
+```bash
+# Bun
+bun add -g incantx
+# npm
+npm i -g incantx
+```
+Then run:
+```bash
+incantx path/to/fixtures --judge off
+```
+### Use as a linked CLI (Bun)
+From this repo:
+```bash
+bun link
+```
+From another repo:
+```bash
+bun link incantx
+incantx path/to/fixtures --judge off
+```
+## What this is
+- A test framework for “agents” that behave like a **single Chat Completions call**: input is `{messages, tools}` and output is the **next assistant message** (which may include `tool_calls`).
+- A fixture format that supports:
+  - Starting tests **mid-conversation** via full message history.
+  - Asserting on the **next assistant message**, including **tool call expectations**.
+  - (Planned) executing tools and continuing until the agent returns a final message.
+  - LLM-based assertions (“judge”) for fuzzy/semantic checks (implemented via OpenAI in `auto` mode; skipped if no `OPENAI_API_KEY`).
+## Repo layout
+- `src/agent/`: agent types + example agents
+  - `src/agent/types.ts`: OpenAI-style message/tool-call types
+  - `src/agent/exampleJsonlAgent.ts`: a language-agnostic “agent process” example (JSONL over stdin/stdout)
+- `src/fixture/`: fixture schema types
+  - `src/fixture/types.ts`: YAML fixture file types (agent command + expectations)
+- `tests/fixtures/`: example YAML fixtures
+  - `tests/fixtures/weather.yaml`: fixture file that targets `exampleJsonlAgent.ts`
+## CLI
+Run fixtures from a file or a directory:
+```bash
+incantx tests/fixtures
+```
+LLM judge modes:
+- `--judge auto` (default): use OpenAI judge if `OPENAI_API_KEY` is set; otherwise skip LLM assertions
+- `--judge off`: never call an LLM judge
+- `--judge on`: require `OPENAI_API_KEY` (fail if missing)
+Example (no judge calls):
+```bash
+incantx tests/fixtures --judge off
+```
+## YAML fixtures
+Fixture files are YAML and contain:
+- a file-level `agent` config (optional), and
+- a `fixtures[]` list, each with:
+  - optional `history[]` (OpenAI Chat Completions message format)
+  - `input` (the new user message to append)
+  - optional `expect` (assertions on the next assistant message)
+### Example fixture file
+See `tests/fixtures/weather.yaml` for a complete working example. The key idea is that `history` is verbatim OpenAI-style messages (including `assistant.tool_calls` and `role: tool` messages), so you can paste real traces.
+```yaml
+agent:
+  type: subprocess
+  command: ["bun", "src/agent/exampleJsonlAgent.ts"]
+fixtures:
+  - id: weather-requests-tool
+    history:
+      - role: system
+        content: You are a helpful assistant.
+    input: What's the weather in Dublin?
+    expect:
+      tool_calls_match: contains
+      tool_calls:
+        - name: get_weather
+          arguments:
+            location: Dublin
+            unit: c
+```
+## Expectations
+All expectations apply to the **next assistant message** produced after the runner:
+1. takes `history` (if any),
+2. appends `{ role: "user", content: input }`,
+3. calls the agent once.
+### Expecting tool calls
+Use `expect.tool_calls` to assert that the next assistant message includes tool calls.
+- `expect.tool_calls_match: contains` (default): each expected tool call must appear somewhere in the returned `tool_calls`; extra tool calls are allowed; order is ignored.
+- `expect.tool_calls_match: exact`: the returned `tool_calls` must match exactly (same length/order and matching entries).
+Tool call matching details:
+- `name` matches `tool_calls[].function.name`.
+- `arguments` is a **subset match** against the parsed JSON from `tool_calls[].function.arguments` (which is a JSON string in OpenAI format).
+### Expecting assistant content (LLM-judged)
+Use `expect.assistant.llm` to express the intended outcome in natural language.
+The CLI can grade this using an LLM judge. By default (`--judge auto`), it will only run if `OPENAI_API_KEY` is set; otherwise these checks are marked `SKIP`.
+## Agent integration (language-agnostic)
+### Subprocess (recommended for local agents)
+For local, language-agnostic agents, the runner will spawn a subprocess and communicate via **JSON Lines** (one JSON object per line) over stdin/stdout.
+- Runner writes one request JSON object per line to the agent’s stdin.
+- Agent writes one response JSON object per line to stdout.
+- One message per call (no streaming).
+**Request (minimal)**
+```json
+{
+  "messages": [{ "role": "user", "content": "hi" }],
+  "tools": [],
+  "tool_choice": "auto",
+  "model": "optional-model-id"
+}
+```
+#### How history is passed
+History is passed by sending the full conversation so far in `messages` **on every call** (Chat Completions style). This is what makes “start tests mid-conversation” possible: the runner simply begins `messages` with whatever prior turns you want.
+When tools are involved, history typically includes:
+1. an assistant message containing `tool_calls`, then
+2. one or more `role: "tool"` messages containing tool results (each with `tool_call_id`), then
+3. the next user message, etc.
+Example (abridged):
+```json
+{
+  "messages": [
+    { "role": "system", "content": "You are a helpful assistant." },
+    { "role": "user", "content": "What's the weather in Dublin?" },
+    {
+      "role": "assistant",
+      "content": "",
+      "tool_calls": [
+        {
+          "id": "call_1",
+          "type": "function",
+          "function": { "name": "get_weather", "arguments": "{\"location\":\"Dublin\"}" }
+        }
+      ]
+    },
+    {
+      "role": "tool",
+      "tool_call_id": "call_1",
+      "name": "get_weather",
+      "content": "{\"temp_c\":10,\"condition\":\"Rain\"}"
+    },
+    { "role": "user", "content": "Should I bring an umbrella?" }
+  ]
+}
+```
+**Response (minimal)**
+```json
+{
+  "message": { "role": "assistant", "content": "hello", "tool_calls": [] }
+}
+```
+If the agent process can’t handle the request, it should return:
+```json
+{ "error": { "message": "..." } }
+```
+### HTTP (optional; useful later for remote agents)
+Once you want to test remote agents, a small OpenAI-compatible subset (e.g. `POST /chat/completions`) is a good secondary adapter. Payload shape should match the `messages`/`tools` schema above.
+## Example agent process
+`src/agent/exampleJsonlAgent.ts` is a reference implementation of the subprocess protocol. It wraps a tiny mock agent (`createMockWeatherAgent`) that demonstrates:
+- returning `tool_calls` when asked about “weather”
+- using a prior `role: tool` message to answer follow-ups like “umbrella”
+## Roadmap
+- Tool execution loop:
+  - if the assistant returns tool calls, run tools, append `role: tool` messages, and continue until no tool calls
+- LLM judge + deterministic grading:
+  - stable prompts, scoring, and report output
+- CLI + GitHub Action wrapper:
+  - run fixture directories, emit JSON/markdown summaries
+## Publishing
+Preview the npm tarball contents:
+```bash
+bun run publish:dry
+```
+This repo uses `prepack`/`prepublishOnly` scripts to build `dist/` and run checks before publishing.
+## License
+Currently `UNLICENSED` (set in `package.json`). Change this before publishing if you intend open-source use.

package/dist/cli.js ADDED Viewed

@@ -0,0 +1,460 @@
+#!/usr/bin/env bun
+// @bun
+// src/cli.ts
+import { stat } from "fs/promises";
+import { extname, resolve as resolve2 } from "path";
+var {Glob } = globalThis.Bun;
+// src/runner/runFixtureFile.ts
+import { readFile } from "fs/promises";
+import { resolve } from "path";
+// src/fixture/load.ts
+import { parse as parseYaml } from "yaml";
+function expandEnvVars(value, env) {
+  return value.replace(/\$\{([A-Z0-9_]+)\}/gi, (_, key) => env[key] ?? "");
+}
+function normalizeAgentSpec(raw, env) {
+  const normalized = { ...raw };
+  if (normalized.type === undefined)
+    normalized.type = "subprocess";
+  if (!Array.isArray(normalized.command) || normalized.command.length === 0) {
+    throw new Error("`agent.command` must be a non-empty array of strings.");
+  }
+  normalized.command = normalized.command.map((part) => String(part));
+  if (normalized.cwd !== undefined)
+    normalized.cwd = String(normalized.cwd);
+  if (normalized.timeout_ms !== undefined)
+    normalized.timeout_ms = Number(normalized.timeout_ms);
+  if (normalized.env) {
+    const expanded = {};
+    for (const [k, v] of Object.entries(normalized.env)) {
+      expanded[k] = expandEnvVars(String(v), env);
+    }
+    normalized.env = expanded;
+  }
+  return normalized;
+}
+function loadFixtureFile(yamlText, env = process.env) {
+  const data = parseYaml(yamlText);
+  if (!data || typeof data !== "object")
+    throw new Error("Fixture file must be a YAML object.");
+  const file = data;
+  if (!Array.isArray(file.fixtures))
+    throw new Error("Fixture file must contain `fixtures: [...]`.");
+  const normalized = {
+    fixtures: file.fixtures
+  };
+  if (file.agent)
+    normalized.agent = normalizeAgentSpec(file.agent, env);
+  normalized.fixtures = normalized.fixtures.map((fixture, index) => {
+    if (!fixture || typeof fixture !== "object")
+      throw new Error(`fixtures[${index}] must be an object.`);
+    if (!("id" in fixture))
+      throw new Error(`fixtures[${index}].id is required.`);
+    if (!("input" in fixture))
+      throw new Error(`fixtures[${index}].input is required.`);
+    const out = { ...fixture };
+    out.id = String(out.id);
+    out.input = String(out.input);
+    if (out.agent)
+      out.agent = normalizeAgentSpec(out.agent, env);
+    if (out.expect?.tool_calls_match === undefined && out.expect?.tool_calls)
+      out.expect.tool_calls_match = "contains";
+    return out;
+  });
+  return normalized;
+}
+// src/judge/openaiJudge.ts
+function createOpenAIJudge(options) {
+  const baseUrl = options.baseUrl ?? process.env.OPENAI_BASE_URL ?? "https://api.openai.com/v1";
+  const model = options.model ?? process.env.OPENAI_JUDGE_MODEL ?? "gpt-4o-mini";
+  return async ({ expectation, message }) => {
+    const system = "You are a strict test evaluator. Decide if the assistant message satisfies the expectation. " + 'Respond with ONLY valid JSON: {"pass": boolean, "reason": string}.';
+    const user = JSON.stringify({
+      expectation,
+      assistant_message: message
+    }, null, 2);
+    const res = await fetch(`${baseUrl}/chat/completions`, {
+      method: "POST",
+      headers: {
+        Authorization: `Bearer ${options.apiKey}`,
+        "Content-Type": "application/json"
+      },
+      body: JSON.stringify({
+        model,
+        temperature: 0,
+        messages: [
+          { role: "system", content: system },
+          { role: "user", content: user }
+        ]
+      })
+    });
+    if (!res.ok) {
+      const body = await res.text().catch(() => "");
+      return {
+        status: "fail",
+        reason: `Judge call failed: ${res.status} ${res.statusText}${body ? `
+${body}` : ""}`
+      };
+    }
+    const data = await res.json();
+    const content = data?.choices?.[0]?.message?.content;
+    if (typeof content !== "string" || content.trim().length === 0) {
+      return { status: "fail", reason: "Judge returned no content." };
+    }
+    let parsed;
+    try {
+      parsed = JSON.parse(content);
+    } catch {
+      return { status: "fail", reason: `Judge did not return valid JSON.
+${content}` };
+    }
+    if (parsed.pass)
+      return { status: "pass" };
+    return { status: "fail", reason: parsed.reason || "Expectation not satisfied." };
+  };
+}
+// src/runner/subprocessAgent.ts
+function isSuccess(value) {
+  return !!value && typeof value === "object" && "message" in value;
+}
+function isError(value) {
+  return !!value && typeof value === "object" && "error" in value;
+}
+async function readFirstNonEmptyLine(stream, timeoutMs) {
+  const reader = stream.getReader();
+  const decoder = new TextDecoder;
+  let buffer = "";
+  const deadline = Date.now() + timeoutMs;
+  while (true) {
+    const timeLeft = deadline - Date.now();
+    if (timeLeft <= 0)
+      throw new Error(`Timed out waiting for agent response after ${timeoutMs}ms.`);
+    let timerId;
+    const timerPromise = new Promise((_, reject) => {
+      timerId = setTimeout(() => reject(new Error(`Timed out waiting for agent response after ${timeoutMs}ms.`)), timeLeft);
+    });
+    let chunk;
+    try {
+      chunk = await Promise.race([reader.read(), timerPromise]);
+    } finally {
+      if (timerId !== undefined)
+        clearTimeout(timerId);
+    }
+    const { value, done } = chunk;
+    if (done)
+      break;
+    buffer += decoder.decode(value, { stream: true });
+    while (true) {
+      const idx = buffer.indexOf(`
+`);
+      if (idx === -1)
+        break;
+      const line = buffer.slice(0, idx).trim();
+      buffer = buffer.slice(idx + 1);
+      if (line.length > 0)
+        return line;
+    }
+  }
+  const tail = buffer.trim();
+  if (tail.length > 0)
+    return tail;
+  throw new Error("Agent produced no output on stdout.");
+}
+async function callSubprocessAgent(spec, request) {
+  const timeoutMs = spec.timeout_ms ?? 20000;
+  const proc = Bun.spawn(spec.command, {
+    cwd: spec.cwd,
+    env: { ...process.env, ...spec.env ?? {} },
+    stdin: "pipe",
+    stdout: "pipe",
+    stderr: "pipe"
+  });
+  const stdin = proc.stdin;
+  if (!stdin)
+    throw new Error("Failed to open agent stdin.");
+  stdin.write(`${JSON.stringify(request)}
+`);
+  stdin.end();
+  let stderrText = "";
+  const stderrPromise = (async () => {
+    if (!proc.stderr)
+      return;
+    stderrText = await new Response(proc.stderr).text().catch(() => "");
+  })();
+  try {
+    if (!proc.stdout)
+      throw new Error("Failed to open agent stdout.");
+    const line = await readFirstNonEmptyLine(proc.stdout, timeoutMs);
+    let payload;
+    try {
+      payload = JSON.parse(line);
+    } catch {
+      throw new Error(`Agent stdout is not valid JSON.
+` + `Line: ${line}
+` + (stderrText ? `Stderr:
+${stderrText}` : ""));
+    }
+    if (isError(payload))
+      throw new Error(payload.error.message);
+    if (!isSuccess(payload)) {
+      throw new Error(`Agent response must be { "message": { ... } } or { "error": { "message": ... } }.`);
+    }
+    return payload.message;
+  } finally {
+    proc.kill();
+    await Promise.allSettled([proc.exited, stderrPromise]);
+  }
+}
+// src/runner/deepMatch.ts
+function deepPartialMatch(expected, actual) {
+  if (expected === actual)
+    return true;
+  if (expected === null || actual === null)
+    return expected === actual;
+  const expectedType = typeof expected;
+  const actualType = typeof actual;
+  if (expectedType !== actualType)
+    return false;
+  if (Array.isArray(expected)) {
+    if (!Array.isArray(actual))
+      return false;
+    if (expected.length !== actual.length)
+      return false;
+    for (let i = 0;i < expected.length; i++) {
+      if (!deepPartialMatch(expected[i], actual[i]))
+        return false;
+    }
+    return true;
+  }
+  if (expectedType === "object") {
+    if (Array.isArray(actual))
+      return false;
+    const expectedObj = expected;
+    const actualObj = actual;
+    for (const [key, expectedValue] of Object.entries(expectedObj)) {
+      if (!(key in actualObj))
+        return false;
+      if (!deepPartialMatch(expectedValue, actualObj[key]))
+        return false;
+    }
+    return true;
+  }
+  return false;
+}
+// src/runner/expectations.ts
+function parseJsonOrUndefined(value) {
+  try {
+    return JSON.parse(value);
+  } catch {
+    return;
+  }
+}
+function matchToolCall(expected, actual) {
+  if (actual.function.name !== expected.name)
+    return false;
+  if (expected.arguments === undefined)
+    return true;
+  const actualArgs = parseJsonOrUndefined(actual.function.arguments);
+  if (actualArgs === undefined)
+    return false;
+  return deepPartialMatch(expected.arguments, actualArgs);
+}
+function checkToolCalls(expect, message) {
+  const expectedCalls = expect.tool_calls ?? [];
+  if (expectedCalls.length === 0)
+    return { status: "pass" };
+  const actualCalls = message.tool_calls ?? [];
+  const mode = expect.tool_calls_match ?? "contains";
+  if (mode === "contains") {
+    for (const expected of expectedCalls) {
+      const ok = actualCalls.some((actual) => matchToolCall(expected, actual));
+      if (!ok) {
+        return {
+          status: "fail",
+          reason: `Expected tool call not found: ${expected.name}`
+        };
+      }
+    }
+    return { status: "pass" };
+  }
+  if (actualCalls.length !== expectedCalls.length) {
+    return {
+      status: "fail",
+      reason: `Expected exactly ${expectedCalls.length} tool call(s), got ${actualCalls.length}.`
+    };
+  }
+  for (let i = 0;i < expectedCalls.length; i++) {
+    const expected = expectedCalls[i];
+    const actual = actualCalls[i];
+    if (!expected || !actual || !matchToolCall(expected, actual)) {
+      return { status: "fail", reason: `Tool call mismatch at index ${i}.` };
+    }
+  }
+  return { status: "pass" };
+}
+async function evaluateExpectations(expect, message, judge) {
+  if (!expect)
+    return { status: "pass" };
+  const toolRes = checkToolCalls(expect, message);
+  if (toolRes.status !== "pass")
+    return toolRes;
+  if (expect.assistant?.llm) {
+    if (!judge)
+      return { status: "skip", reason: "LLM judge not configured." };
+    return await judge({ expectation: expect.assistant.llm, message });
+  }
+  return { status: "pass" };
+}
+// src/runner/runFixtureFile.ts
+function pickAgentSpec(file, fixture) {
+  const agent = fixture.agent ?? file.agent;
+  if (!agent)
+    throw new Error(`Fixture '${fixture.id}' has no agent. Add file-level 'agent:' or fixture-level 'agent:'.`);
+  if ((agent.type ?? "subprocess") !== "subprocess")
+    throw new Error(`Unsupported agent type: ${agent.type}`);
+  return agent;
+}
+function makeJudge(mode, model) {
+  if (mode === "off")
+    return;
+  const apiKey = process.env.OPENAI_API_KEY;
+  if (!apiKey) {
+    if (mode === "on")
+      throw new Error("Judge mode is 'on' but OPENAI_API_KEY is not set.");
+    return;
+  }
+  return createOpenAIJudge({ apiKey, model });
+}
+async function runFixtureFile(path, options = {}) {
+  const absolute = resolve(path);
+  const yamlText = await readFile(absolute, "utf8");
+  const file = loadFixtureFile(yamlText);
+  const judgeMode = options.judgeMode ?? "auto";
+  const judge = makeJudge(judgeMode, options.judgeModel);
+  const results = [];
+  for (const fixture of file.fixtures) {
+    try {
+      const agent = pickAgentSpec(file, fixture);
+      const messages = [...fixture.history ?? [], { role: "user", content: fixture.input }];
+      const message = await callSubprocessAgent(agent, {
+        messages,
+        tools: [],
+        tool_choice: "auto"
+      });
+      const expectation = await evaluateExpectations(fixture.expect, message, judge);
+      results.push({
+        id: fixture.id,
+        status: expectation.status,
+        reason: expectation.reason,
+        message
+      });
+    } catch (err) {
+      const reason = err instanceof Error ? err.message : String(err);
+      results.push({ id: fixture.id, status: "fail", reason });
+    }
+  }
+  return { path: absolute, results };
+}
+// src/cli.ts
+function usage() {
+  return [
+    "Usage:",
+    "  incantx <file-or-dir> [--judge auto|off|on] [--judge-model <model>]",
+    "  incantx run <file-or-dir> [--judge auto|off|on] [--judge-model <model>]",
+    "",
+    "Examples:",
+    "  incantx tests/fixtures/weather.yaml",
+    "  incantx tests/fixtures --judge off"
+  ].join(`
+`);
+}
+function parseArgs(argv) {
+  const [first, ...restAll] = argv;
+  const opts = {};
+  if (!first)
+    return { command: "help" };
+  if (first === "-h" || first === "--help")
+    return { command: "help" };
+  const target = first === "run" ? restAll[0] : first;
+  const rest = first === "run" ? restAll.slice(1) : restAll;
+  for (let i = 0;i < rest.length; i++) {
+    const a = rest[i];
+    if (a === "--judge")
+      opts.judge = rest[++i];
+    else if (a === "--judge-model")
+      opts.judgeModel = rest[++i];
+    else if (a === "-h" || a === "--help")
+      return { command: "help" };
+    else
+      throw new Error(`Unknown arg: ${a}`);
+  }
+  return { command: "run", target, opts };
+}
+async function listFixtureFiles(target) {
+  const abs = resolve2(target);
+  const s = await stat(abs);
+  if (s.isFile())
+    return [abs];
+  const glob = new Glob("**/*.{yaml,yml}");
+  const out = [];
+  for await (const rel of glob.scan({ cwd: abs, onlyFiles: true })) {
+    out.push(resolve2(abs, rel));
+  }
+  out.sort();
+  return out;
+}
+function formatStatus(status) {
+  if (status === "pass")
+    return "PASS";
+  if (status === "skip")
+    return "SKIP";
+  return "FAIL";
+}
+async function main() {
+  const parsed = parseArgs(process.argv.slice(2));
+  if (parsed.command === "help" || !("command" in parsed) || parsed.command !== "run") {
+    console.log(usage());
+    process.exit(0);
+  }
+  if (!parsed.target)
+    throw new Error("Missing <file-or-dir>.");
+  const files = await listFixtureFiles(parsed.target);
+  if (files.length === 0)
+    throw new Error(`No fixture files found under: ${parsed.target}`);
+  let pass = 0;
+  let fail = 0;
+  let skip = 0;
+  for (const file of files) {
+    if (![".yaml", ".yml"].includes(extname(file)))
+      continue;
+    const res = await runFixtureFile(file, {
+      judgeMode: parsed.opts.judge ?? "auto",
+      judgeModel: parsed.opts.judgeModel
+    });
+    console.log(res.path);
+    for (const r of res.results) {
+      console.log(`  ${formatStatus(r.status)}  ${r.id}${r.reason ? ` \u2014 ${r.reason}` : ""}`);
+      if (r.status === "pass")
+        pass++;
+      else if (r.status === "skip")
+        skip++;
+      else
+        fail++;
+    }
+  }
+  console.log(`
+Summary: ${pass} passed, ${fail} failed, ${skip} skipped`);
+  process.exit(fail > 0 ? 1 : 0);
+}
+await main();
+//# debugId=E9C824803FAB034D64756E2164756E21