npm - ashr-labs - Versions diffs - 0.4.0 → 0.4.2 - Mend

ashr-labs 0.4.0 → 0.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/README.md ADDED Viewed

@@ -0,0 +1,245 @@
+# Ashr Labs TypeScript SDK
+A TypeScript client library for evaluating AI agents against Ashr Labs test datasets.
+## Documentation
+- [Testing Your Agent](docs/testing-your-agent.md) — **start here**
+- [Quick Start Guide](docs/quickstart.md)
+- [Installation](docs/installation.md)
+- [Authentication](docs/authentication.md)
+- [API Reference](docs/api-reference.md)
+- [Error Handling](docs/error-handling.md)
+- [Examples](docs/examples.md)
+## Installation
+```bash
+npm install ashr-labs
+```
+## Quick Start
+```typescript
+import { AshrLabsClient, EvalRunner } from "ashr-labs";
+// Only need your API key — baseUrl and tenantId are automatic
+const client = new AshrLabsClient("tp_your_api_key_here");
+// Fetch a dataset and run your agent against it
+const runner = await EvalRunner.fromDataset(client, 42);
+const run = await runner.run(myAgent);
+// Inspect results
+const metrics = run.build().aggregate_metrics as Record<string, unknown>;
+console.log(`Passed: ${metrics.tests_passed}/${metrics.total_tests}`);
+console.log(`Avg similarity: ${metrics.average_similarity_score}`);
+// Submit results
+await run.deploy(client, 42);
+```
+Your agent just needs two methods:
+```typescript
+import type { Agent } from "ashr-labs";
+const myAgent: Agent = {
+  async respond(message: string) {
+    // Call your LLM, return { text: "...", tool_calls: [...] }
+    return { text: "response", tool_calls: [] };
+  },
+  async reset() {
+    // Clear conversation history between scenarios
+  },
+};
+```
+See [Testing Your Agent](docs/testing-your-agent.md) for a full end-to-end guide.
+## Observability — Production Tracing
+Trace your agent in production. Captures LLM calls, tool invocations, and events. **Never rejects** — if the backend is unreachable, errors are logged silently.
+```typescript
+// wrap() pattern — auto-end on completion, auto-capture errors
+await client.trace("handle-ticket", { userId: "user_42" }).wrap(async (trace) => {
+  const gen = trace.generation("classify", { model: "claude-sonnet-4-6",
+    input: [{ role: "user", content: "help" }] });
+  const result = await callLlm(...);
+  gen.end({ output: result, usage: { input_tokens: 50, output_tokens: 12 } });
+  await trace.span("tool:search", { input: { q: "..." } }).wrap(async (s) => {
+    const data = await search(...);
+    s.end({ output: data });
+  });
+});
+// Analytics
+const analytics = await client.getObservabilityAnalytics(7);
+console.log(`Traces: ${analytics.overview.total_traces}`);
+console.log(`Tool calls: ${analytics.overview.total_tool_calls}`);
+```
+See [API Reference](docs/api-reference.md) for full Trace/Span/Generation docs.
+## VM Stream Logs
+Attach virtual machine session logs to test results for browser-based or desktop-based agents:
+```typescript
+test = run.addTest("checkout_flow");
+test.start();
+// ... run agent, add tool calls and responses ...
+// Kernel browser session (first-class support)
+test.setKernelVm("kern_sess_abc123", {
+  durationMs: 15000,
+  logs: [
+    { ts: 0, type: "navigation", data: { url: "https://app.example.com" } },
+    { ts: 1200, type: "action", data: { action: "click", selector: "#login" } },
+  ],
+  replayId: "replay_abc123",
+  replayViewUrl: "https://www.kernel.sh/replays/replay_abc123",
+  stealth: true,
+  viewport: { width: 1920, height: 1080 },
+});
+// Or use the generic setVmStream() for any provider
+test.setVmStream("browserbase", {
+  sessionId: "sess_abc123",
+  durationMs: 45000,
+  logs: [
+    { ts: 0, type: "navigation", data: { url: "https://app.example.com" } },
+    { ts: 1200, type: "action", data: { action: "click", selector: "#login" } },
+  ],
+});
+test.complete();
+```
+## Available Methods
+All methods that accept `tenantId` auto-resolve it from your API key if omitted.
+### Datasets
+| Method | Description |
+|--------|-------------|
+| `getDataset(datasetId, ...)` | Get a dataset by ID |
+| `listDatasets(tenantId, limit, offset, ...)` | List datasets |
+### Runs
+| Method | Description |
+|--------|-------------|
+| `createRun(datasetId, result, ...)` | Create a new test run |
+| `getRun(runId)` | Get a run by ID |
+| `listRuns(datasetId, tenantId, limit, offset)` | List runs |
+| `deleteRun(runId)` | Delete a run |
+### EvalRunner
+| Method | Description |
+|--------|-------------|
+| `EvalRunner.fromDataset(client, datasetId)` | Create a runner from a dataset |
+| `runner.run(agent, { maxWorkers })` | Run agent against all scenarios, return `RunBuilder` |
+| `runner.runAndDeploy(agent, client, datasetId, { maxWorkers })` | Run and submit in one call |
+### RunBuilder
+| Method | Description |
+|--------|-------------|
+| `new RunBuilder()` | Create a new run builder |
+| `run.start()` | Mark the run as started |
+| `run.addTest(testId)` | Add a test and get a `TestBuilder` |
+| `run.complete(status)` | Mark the run as completed |
+| `run.build()` | Serialize to a result object |
+| `run.deploy(client, datasetId)` | Build and submit via the API |
+### TestBuilder
+| Method | Description |
+|--------|-------------|
+| `test.start()` | Mark the test as started |
+| `test.addUserFile(filePath, description)` | Record a user file upload |
+| `test.addUserText(text, description)` | Record a user text input |
+| `test.addToolCall(expected, actual, matchStatus)` | Record an agent tool call |
+| `test.addAgentResponse(expectedResponse, actualResponse, matchStatus)` | Record an agent response |
+| `test.setVmStream(provider, opts)` | Attach VM session logs |
+| `test.setKernelVm(sessionId, opts)` | Attach Kernel VM session (convenience) |
+| `test.complete(status)` | Mark the test as completed |
+### Requests
+| Method | Description |
+|--------|-------------|
+| `createRequest(requestName, request, ...)` | Create a new request |
+| `getRequest(requestId)` | Get a request by ID |
+| `listRequests(tenantId, status, limit, offset)` | List requests |
+### Observability
+| Method | Description |
+|--------|-------------|
+| `client.trace(name, opts?)` | Start a production trace (returns `Trace`) |
+| `trace.span(name, opts?)` / `trace.generation(name, opts?)` | Add spans or LLM calls |
+| `trace.wrap(fn)` / `span.wrap(fn)` | Auto-end on completion, auto-capture errors |
+| `await trace.end(opts?)` | Flush trace to backend (**never rejects**) |
+| `listObservabilityTraces(opts?)` | List traces |
+| `getObservabilityTrace(traceId)` | Get trace with full observation tree |
+| `getObservabilityAnalytics(days?)` | Analytics: tokens, latency, errors, tool perf |
+| `getObservabilityErrors(opts?)` | Traces with errors |
+| `getObservabilityToolErrors(opts?)` | Traces with tool failures |
+### API Keys & Session
+| Method | Description |
+|--------|-------------|
+| `init()` | Validate credentials and get user/tenant info |
+| `listApiKeys(includeInactive)` | List API keys for your tenant |
+| `revokeApiKey(apiKeyId)` | Revoke an API key |
+| `healthCheck()` | Check if the API is reachable |
+## Error Handling
+```typescript
+import { AshrLabsClient, NotFoundError, AuthenticationError } from "ashr-labs";
+const client = new AshrLabsClient("tp_...");
+try {
+  const dataset = await client.getDataset(999);
+} catch (e) {
+  if (e instanceof AuthenticationError) {
+    console.log("Invalid API key");
+  } else if (e instanceof NotFoundError) {
+    console.log("Dataset not found");
+  }
+}
+```
+## Configuration
+```typescript
+// All defaults — just pass API key
+const client = new AshrLabsClient("tp_...");
+// From environment (reads ASHR_LABS_API_KEY)
+const client = AshrLabsClient.fromEnv();
+// Custom timeout
+const client = new AshrLabsClient("tp_...", undefined, 60);
+// Custom base URL (for self-hosted)
+const client = new AshrLabsClient("tp_...", "https://your-api.example.com");
+```
+## Requirements
+- Node.js 18+
+- TypeScript 5.4+ (recommended)
+## License
+MIT

package/dist/cli.js CHANGED Viewed

@@ -148,6 +148,140 @@ Continuously improve "${config.agentName}" by running evaluations and fixing iss
 - Deploy the final passing run with \`results.deploy(client, datasetId)\`.
 `;
 }
+function generateCodexTestAgentSkill(config) {
+    const isTs = config.lang === "typescript";
+    const sdkPkg = isTs ? "ashr-labs" : "ashr_labs";
+    return `---
+name: test-agent
+description: Run the Ashr Labs eval suite against the agent and report results. Use when the user wants to test, evaluate, or benchmark agent behavior.
+---
+Run an automated evaluation of the "${config.agentName}" agent using the Ashr Labs SDK.
+Do this FULLY AUTONOMOUSLY — do not ask the user for input at any step.
+## Dataset strategy
+**Reuse existing datasets whenever possible.** Only generate a new dataset when:
+- The agent's tools have changed (added, removed, or modified)
+- The agent's domain or accepted inputs have changed
+- You explicitly need fresh/different test scenarios
+To find existing datasets, call \`client.list_datasets()\` and pick the most recent
+one for this agent. Store the dataset ID in \`.ashr.json\` under \`eval.datasetId\`
+for future runs.
+## Steps
+1. Read \`.ashr.json\` for project configuration.
+2. Read the agent code at \`${config.entrypoint}\` to understand current behavior.
+3. Check if \`.ashr.json\` has \`eval.datasetId\` — if so, reuse that dataset.
+   If not, call \`client.list_datasets()\` to find an existing one for this agent.
+   Only call \`client.generate_dataset()\` if no suitable dataset exists.
+4. Write a SINGLE self-contained eval script (\`_ashr_eval.py\`) that does EVERYTHING
+   in one process — no separate scripts, no separate polling commands:
+   - Imports \`AshrLabsClient\`, \`EvalRunner\` from \`${sdkPkg}\`
+   - Imports \`time\`, \`json\`, \`os\`
+   - Creates a client using \`os.environ["${config.apiKeyEnvVar}"]\`
+   - Uses \`EvalRunner.from_dataset(client, dataset_id=ID)\` to load the dataset
+   - Implements a lightweight Agent wrapper around the actual agent code in \`${config.entrypoint}\`
+   - Runs \`runner.run(agent)\` with \`max_workers=1\` (sequential — avoids deepcopy issues)
+   - Deploys results with \`run.deploy(client, dataset_id=ID)\`
+   - **Polls for grading inline**: loops calling \`client.get_run(run_id)\` every 20 seconds
+     until \`aggregate_metrics.tests_passed\` is not None (max 5 minutes)
+   - Prints final graded results: passed, failed, per-test status, tool mismatches
+5. Run the eval script with a long timeout (e.g. 10 minutes).
+6. If a new dataset was generated, update \`.ashr.json\` with the new \`eval.datasetId\`.
+7. Delete \`_ashr_eval.py\`.
+## CRITICAL: Single command execution
+The eval script MUST be a single file that runs as a single command. Do NOT write
+separate scripts for polling or analysis. Do NOT use \`sleep\` in bash. Everything
+happens inside the Python script so it only requires ONE permission grant to run.
+## Important
+- Use the \`${config.apiKeyEnvVar}\` env var for the API key — never hardcode it.
+- All grading is done server-side — do not compute scores locally.
+- Do NOT ask the user for permission or input. Run everything automatically.
+`;
+}
+function generateCodexImproveAgentSkill(config) {
+    const isTs = config.lang === "typescript";
+    const sdkPkg = isTs ? "ashr-labs" : "ashr_labs";
+    return `---
+name: improve-agent
+description: Automatically run evals, analyze failures, fix the agent, and re-test until passing. Use when the user wants to improve agent quality or fix failing tests.
+---
+Continuously improve "${config.agentName}" by running evaluations, analyzing failures, and
+applying fixes — ALL AUTOMATICALLY without asking for user input.
+## Overview
+This is an autonomous improvement loop. You will:
+1. Run eval + wait for grading (single script, single command)
+2. Analyze failures from the output
+3. Fix the agent code
+4. Re-run eval (same single script, single command)
+5. Repeat until target pass rate is met
+Do NOT ask the user for permission between iterations. Just run the loop.
+## CRITICAL: Single-script execution
+ALL eval + grading + analysis MUST happen in ONE Python script (\`_ashr_eval.py\`) run as
+ONE command. The script must:
+1. Import the agent from \`${config.entrypoint}\` and \`AshrLabsClient\`, \`EvalRunner\` from \`${sdkPkg}\`
+2. Load dataset from \`.ashr.json\` \`eval.datasetId\` using \`EvalRunner.from_dataset()\`
+3. Run \`runner.run(agent)\` with \`max_workers=1\`
+4. Deploy with \`run.deploy(client, dataset_id=ID)\`
+5. Poll \`client.get_run(run_id)\` every 20s until \`aggregate_metrics.tests_passed\` is not None (max 5min)
+6. Print graded results: pass/fail per test, tool mismatches with expected vs actual args
+7. For each FAILED test, also fetch and print the dataset scenario actions so the failure
+   context is visible (user messages, expected tool calls at each action index)
+## Iteration loop
+### After each eval run, read the output and analyze failures:
+Common failure patterns:
+- **Tool called at wrong step**: Agent calls the right tool but too early/late.
+  Fix: adjust system prompt to be more/less eager about that tool.
+- **Tool not called**: Agent didn't call an expected tool.
+  Fix: strengthen prompt guidance about when to use that tool.
+- **Extra unexpected tool call**: Agent called a tool it shouldn't have.
+  Fix: add prompt constraints about when NOT to call that tool.
+- **Wrong tool arguments**: Agent called the right tool with wrong args.
+  Fix: improve tool descriptions or add examples in the system prompt.
+- **Tool retry timing**: Agent auto-retries a failed tool instead of returning control
+  (or vice versa). Fix: adjust the tool execution loop in the agent code.
+### Apply fixes to \`${config.entrypoint}\`:
+- **System prompt changes** fix most failures — adjust instructions about tool timing
+- **Tool loop logic** changes fix retry/timing issues
+- Make the SMALLEST change that addresses each failure. Do not refactor unrelated code.
+### Re-run eval:
+After editing \`${config.entrypoint}\`, run the SAME \`_ashr_eval.py\` script again (it imports
+the agent fresh each time). Same single command, same timeout.
+### Stop conditions:
+- All tests pass or pass rate >= 80%: Stop. Print summary of changes + before/after metrics.
+- 5 iterations reached: Stop. Summarize what's still failing.
+- No improvement after 2 consecutive iterations: Stop. Explain blockers.
+Clean up \`_ashr_eval.py\` when done.
+## Rules
+- Do NOT ask the user for input at any step. Run everything automatically.
+- Do NOT write separate scripts for polling or analysis — everything in ONE script.
+- Use \`${config.apiKeyEnvVar}\` env var for the API key.
+- All grading is server-side. Never grade or score locally.
+`;
+}
 function generateHookSettings(config) {
     return {
         hooks: {
@@ -387,6 +521,11 @@ async function main() {
     writeFile(".claude/commands/test-agent.md", generateTestAgentCommand(config));
     writeFile(".claude/commands/improve-agent.md", generateImproveAgentCommand(config));
     mergeJsonFile(".claude/settings.json", generateHookSettings(config));
+    // Codex + Cursor skills (both scan .agents/skills/; Cursor also scans .cursor/skills/)
+    writeFile(".agents/skills/test-agent/SKILL.md", generateCodexTestAgentSkill(config));
+    writeFile(".agents/skills/improve-agent/SKILL.md", generateCodexImproveAgentSkill(config));
+    writeFile(".cursor/skills/test-agent/SKILL.md", generateCodexTestAgentSkill(config));
+    writeFile(".cursor/skills/improve-agent/SKILL.md", generateCodexImproveAgentSkill(config));
     // Done
     print(`\n${GREEN}  Done.${RESET} Open Claude Code and type ${BOLD}/test-agent${RESET}\n`);
 }

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "ashr-labs",
-  "version": "0.4.0",
+  "version": "0.4.2",
   "description": "TypeScript SDK for the Ashr Labs API — agent testing & evaluation",
   "type": "module",
   "main": "./dist/index.js",