npm - @tangle-network/agent-eval - Versions diffs - 0.17.3 → 0.19.0 - Mend

@tangle-network/agent-eval 0.17.3 → 0.19.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/README.md +8 -1
package/dist/index.d.ts +303 -279
package/dist/index.js +332 -210
package/dist/index.js.map +1 -1
package/docs/concepts.md +155 -0
package/docs/control-runtime.md +351 -0
package/docs/feature-guide.md +213 -0
package/docs/feedback-trajectories.md +193 -0
package/docs/multi-shot-optimization.md +122 -0
package/docs/wire-protocol.md +199 -0
package/package.json +21 -14

package/docs/concepts.md ADDED Viewed

@@ -0,0 +1,155 @@
+# Concepts
+Read this once and the rest of agent-eval makes sense.
+## What is agent-eval?
+A library for **deciding whether a code generator or content generator did its job.** You give it a thing the generator produced (a scaffold, a patch, a tweet, a JSON config), and you get back a structured verdict: pass/fail, dimension scores, a reason in plain English.
+It exists because LLMs lie about whether they succeeded. A model will say "Done!" and ship code that doesn't compile. agent-eval is the layer between the model's output and your decision to ship.
+## The three things you'll touch most
+| Thing | What it is | One-line example |
+|---|---|---|
+| **Judge** | A function that scores one piece of output. | "Did this scaffold implement async fetching?" |
+| **Rubric** | The recipe a judge uses — what to score on, with what weights. | "Score on buyer_quality (0.5), voice (0.3), signal (0.2)." |
+| **Verifier** | A pipeline of judges run in order, with dependencies. | "install → typecheck → build → semantic" |
+| **Feedback trajectory** | A multi-shot record of attempts, approvals, rejections, edits, metrics, and policy outcomes. | "draft → user rejects → revised draft → approved → measured" |
+That's the whole framework. Everything else (sessions, traces, layers) is plumbing around those three.
+When the thing being evaluated is an agent that should keep working, use
+[`runAgentControlLoop`](./control-runtime.md). It turns validators into a
+runtime loop: observe typed state, validate it, decide the next action, act,
+and repeat until the task passes, blocks, times out, spends too much, or stops
+making progress.
+When normal agent usage should become reusable training/eval signal, use
+[`FeedbackTrajectory`](./feedback-trajectories.md). It captures approvals,
+rejections, edits, option choices, metrics, and policy blocks as portable data
+that can seed memory, replay scenarios, and optimization.
+## Vocabulary, plain English
+| Term | Plain English |
+|---|---|
+| **Artifact** | The thing being judged. Often a workdir of files, sometimes a string of text. |
+| **Snapshot** | A frozen view of an artifact (every file path → content). What the judge actually reads. |
+| **Harness** | A description of *how to run* the artifact: setup command, test command, working dir, timeout. |
+| **Sandbox driver** | The thing that actually executes commands inside the harness. Local subprocess, or remote container. |
+| **Layer** | One stage of a verifier pipeline (install, typecheck, build, semantic, …). |
+| **Finding** | A specific issue a judge found — file, line, severity, message. |
+| **Trace store** | The append-only log of every span/event during a run. Replay = read this back. |
+| **Composite score** | A 0..1 number combining all dimensions. The single number you gate on. |
+| **Rubric version** | A stable hash of the rubric. Scores from different rubric versions are not comparable. |
+| **Muffled gate** | A check that should fail loud but silently passes (e.g. `command || true`). The most expensive bug class in this codebase — see SKILL.md. |
+## The feedback trajectory loop
+For agentic systems, the highest-quality labels often come from normal review
+workflow, not a separate labeling UI:
+```text
+agent proposes -> user approves/rejects/edits/selects -> agent revises -> outcome is measured
+```
+`FeedbackTrajectory` is the portable record of that loop. Browser agents can
+store task outcomes, coding agents can store patch review plus test results,
+and research agents can store reviewer corrections. The domain changes; the
+shape stays the same.
+Those trajectories can be converted into preference memory, `DatasetScenario`
+rows, optimizer rows, and held-out examples for overfit checks.
+## The three-layer eval (for code generators)
+When the artifact is generated code, agent-eval scores it at three independent layers. Each layer fails differently, and you want to know which one broke:
+```
+L0  builder        Did the agent's session itself work?
+                   (Did it produce an artifact at all?)
+                              │
+                              ▼
+L1  app-build      Does the artifact build / typecheck / test?
+                   (Static signal, ground-truth gate.)
+                              │
+                              ▼
+L2  app-runtime    Does the artifact actually run end-to-end?
+                   (Dynamic signal — only worth checking if L1 passed.)
+```
+`BuilderSession` orchestrates this. It opens at `startChat`, runs the build at `ship`, runs the runtime check at `runAppScenario`. Each layer emits a trace span. Composite score aggregates them with `scoreProject`.
+Why three? Because each catches a different failure mode:
+- L0 misses — agent crashed mid-generation, you have a half-written file.
+- L1 misses — files exist but typecheck fails. LLM judges can't reliably catch this.
+- L2 misses — code compiles but does the wrong thing at runtime.
+If you only check one layer, you ship the bugs that the other two layers would have caught.
+## How rubrics work
+A rubric describes:
+1. **Dimensions** — the axes you score on (e.g. `buyer_quality`, `voice`, `signal`).
+2. **Weights** — how to combine dimensions into a composite (`0.5 * buyer_quality + 0.3 * voice + 0.2 * signal`).
+3. **Failure modes** — named patterns the judge looks for ("ai-cadence", "vague-claim").
+4. **Wins** — named positive patterns ("specific-component", "earned-detail").
+5. **System prompt** — what to tell the judging LLM about the persona and the task.
+Built-in rubrics ship in `src/wire/rubrics.ts` (e.g. `anti-slop` for technical-buyer voice). You can also pass a rubric inline — the same shape, just defined at the call site.
+A rubric is plain data. The hash of that data is the `rubricVersion`. Two scores are only comparable if they used the same `rubricVersion` — change the rubric and you start a new comparison series.
+## How verifiers work
+When you have a multi-step pipeline (install → typecheck → build → lint → semantic), use `MultiLayerVerifier`:
+```ts
+const verifier = new MultiLayerVerifier([
+  installLayer,      // runs `pnpm install`
+  typecheckLayer,    // runs `tsc --noEmit`, depends on install
+  buildLayer,        // runs `pnpm build`, depends on typecheck
+  semanticLayer,     // LLM judge, weight 3, depends on build
+])
+const report = await verifier.run({ env: { runner, workdir, ... } })
+report.allPass        // boolean — every layer passed
+report.blendedScore   // 0..1 — weighted aggregate
+report.layers         // per-layer status, findings, duration
+```
+Two rules that will save you bugs (paid for in real incidents — see SKILL.md):
+1. **Run both gates.** Build gates catch code that doesn't compile; structural assertions catch missing files. Run both unconditionally — they catch orthogonal failures.
+2. **Pair LLM judges with build outcomes.** An LLM judge will rate non-compiling code as "looks right" (0.8). Always short-circuit on `buildOutcome.passed === false` before any LLM judging.
+## The trace model (skip on first read)
+Every operation emits structured spans into a `TraceStore`. A run is a tree:
+```
+builder-session                 [span]
+├── chat-turn                   [span]
+├── ship                        [span]
+│   ├── harness.install         [span]
+│   ├── harness.typecheck       [span]
+│   └── harness.build           [span]
+└── app-runtime                 [span]
+    └── scenario.run            [span]
+```
+Spans are append-only and have stable ids — replay is reading the same store back. OTLP export ships them out for distributed tracing.
+You don't need to build the trace tree by hand. `BuilderSession` does it for you. Look at the trace store when you're debugging a flaky run; ignore it otherwise.
+## Where to go next
+- **Need the layman feature map?** → [feature-guide.md](./feature-guide.md) — what each primitive does, when to use it, integration patterns, and guardrails.
+- **Just want to score a string against a rubric?** → [wire-protocol.md](./wire-protocol.md) — HTTP/RPC interface, pluggable from any language.
+- **Need a reusable driver/worker/evaluator loop?** → [control-runtime.md](./control-runtime.md) — generic runtime plus coding, browser, computer-use, and research integration patterns.
+- **Want review feedback to become eval/optimization data?** → [feedback-trajectories.md](./feedback-trajectories.md) — turn feedback into datasets, optimizer rows, and preference memory.
+- **Building a code-generator eval?** → SKILL.md §Minimal working path — the `BuilderSession` recipe.
+- **Multi-layer verifier?** → SKILL.md §Verification pipeline.
+- **Adding a new judge or rubric?** → `src/wire/rubrics.ts` for the cross-language path; `src/anti-slop.ts` and `src/judges.ts` for the in-process path.

package/docs/control-runtime.md ADDED Viewed

@@ -0,0 +1,351 @@
+# Agent Control Runtime
+`runAgentControlLoop` is the smallest reusable runtime for agentic tasks:
+```text
+observe state -> validate state -> decide next action -> act -> repeat
+```
+It is intentionally not a topology framework. Direct execution, driver
+intervention, critique/revision, specialist fan-out, and user escalation are
+all just actions selected by policy.
+Use it when an agent should keep working until objective state says the task is
+done, blocked, too expensive, or no longer improving.
+## Core API
+```ts
+import {
+  objectiveEval,
+  runAgentControlLoop,
+  subjectiveEval,
+} from '@tangle-network/agent-eval'
+const result = await runAgentControlLoop({
+  intent: 'Create a final answer with citations and no math errors.',
+  budget: { maxSteps: 6, maxWallMs: 180_000, maxCostUsd: 1.50 },
+  async observe() {
+    return await readCurrentTaskState()
+  },
+  async validate({ state }) {
+    return [
+      objectiveEval({
+        id: 'citations-present',
+        passed: state.citations.length >= 2,
+        severity: 'critical',
+      }),
+      objectiveEval({
+        id: 'math-reconciles',
+        passed: state.mathErrors.length === 0,
+        severity: 'critical',
+      }),
+      subjectiveEval({
+        id: 'answer-usefulness',
+        passed: state.judgeScore >= 0.8,
+        score: state.judgeScore,
+        severity: 'warning',
+      }),
+    ]
+  },
+  async decide({ evals, history }) {
+    const failed = evals.filter((e) => !e.passed)
+    if (!failed.length) return { type: 'stop', pass: true, reason: 'done' }
+    return {
+      type: 'continue',
+      action: { type: 'revise', failures: failed.map((e) => e.id) },
+      reason: `fix ${failed.map((e) => e.id).join(', ')}`,
+    }
+  },
+  async act(action) {
+    return await worker.act(action)
+  },
+  getActionCostUsd: ({ result }) => result.costUsd,
+  stopPolicies: {
+    maxNoProgressSteps: 2,
+    maxRepeatedActions: 3,
+  },
+})
+```
+## Design Rules
+- Keep domain adapters in downstream repos until they are reused by multiple
+  integrations.
+- Use the same adapter in product, benchmark replay, and optimization. Swapping
+  the state reader or worker implementation is fine; changing validators,
+  action semantics, or stop policy means you are no longer measuring what users
+  experience.
+- Prefer objective validators over LLM judges. Use LLM judges for judgment,
+  usefulness, clarity, and domain expert review.
+- Treat irreversible external actions as domain policy, not runtime policy.
+  The runtime can stop loops; the downstream adapter must decide which actions
+  require approval before `act()`.
+- Use typed state. Do not make the policy reason only over transcript text.
+- Make `act()` return cost when possible so `maxCostUsd` can enforce recorded
+  spend.
+## Product / Eval Contract
+The runtime is most useful when a downstream product exposes a small adapter:
+```ts
+interface ProductControlAdapter<State, Action, ActionResult> {
+  observe(): Promise<State>
+  validate(state: State): Promise<ControlEvalResult[]>
+  decide(ctx: ControlContext<State, Action, ActionResult>): Promise<ControlDecision<Action>>
+  act(action: Action): Promise<ActionResult>
+}
+```
+Production passes the adapter real sessions, credentials, and storage. Evals
+pass the same adapter replay fixtures, sandboxes, or recorded traces. The
+adapter boundary is the transfer point between training and real usage.
+Avoid this split:
+```text
+benchmark harness has one loop
+product runtime has another loop
+optimizer tunes only the benchmark loop
+```
+That creates benchmark wins that do not transfer. Keep one loop and vary only
+the dependencies behind `observe` and `act`.
+## What the Runtime Guarantees
+- `maxSteps`, `maxWallMs`, and `maxCostUsd` guard runaway loops.
+- repeated-action and no-progress stop policies catch stuck behavior.
+- `actionFailure: 'continue'` records worker failures and lets policy recover.
+- `actionFailure: 'stop'` fails fast for workflows where a failed action should
+  abort.
+- observation, validation, decision, stop-policy, and action failures are
+  returned as structured `runtimeErrors` instead of disappearing.
+- trace sink and `onStep` callback failures are recorded in `runtimeErrors`
+  but do not abort the control loop. Agent progress should not depend on
+  telemetry availability.
+- action-policy preflight belongs before `act()`. Use `evaluateActionPolicy`
+  to block or label side effects, budget breaches, and missing expected
+  outcomes before any irreversible action runs.
+- when a `TraceStore` is supplied, the runtime emits:
+  - one run
+  - one tool span per control step
+  - one judge span per eval result
+  - budget ledger entries for recorded spend
+## Propose / Review Preset
+`runProposeReviewAsControlLoop` adapts the common artifact-refinement loop onto
+the generic runtime:
+```text
+propose -> verify -> review -> propose again
+```
+Use it when the task is naturally "produce or improve state until verification
+passes."
+```ts
+import { runProposeReviewAsControlLoop } from '@tangle-network/agent-eval'
+const report = await runProposeReviewAsControlLoop({
+  goal: 'Make the implementation pass tests and satisfy the reviewer.',
+  initialState: { workdir },
+  maxShots: 5,
+  async propose({ state, priorReview }) {
+    return await codingAgent.patch({
+      workdir: state.workdir,
+      instruction: priorReview?.nextShotInstruction,
+    })
+  },
+  async verify(state) {
+    const tests = await runTests(state.workdir)
+    return {
+      pass: tests.ok,
+      score: tests.ok ? 1 : 0,
+      failingLayers: tests.ok ? [] : ['tests'],
+      details: tests,
+    }
+  },
+  async review({ verification }) {
+    return await reviewer.explainNextShot(verification)
+  },
+  failureClassFromVerification(verification) {
+    if (verification.failingLayers?.includes('tests')) return 'sandbox_failure'
+    return 'unknown'
+  },
+})
+```
+Long term, `runProposeReview` should remain the stable convenience API, while
+its internals can route through this control-loop preset.
+## Domain Patterns
+These examples show what belongs in product repos. They should not become core
+`agent-eval` adapters until the same adapter shape is reused by multiple
+products.
+## Shared Sandbox Execution
+Yes, harnesses and judges can run against the same sandbox. The common pattern
+is to pass one sandbox driver and one workdir through every layer:
+```ts
+const driver = new SubprocessSandboxDriver({ cwd: workdir, env })
+const harness = new SandboxHarness(driver)
+```
+Use the same sandbox when checks need shared state:
+- install dependencies once, then typecheck/build/test in the same workdir
+- run a browser/computer-use scenario against the app the harness just started
+- let a judge inspect files, logs, screenshots, or traces produced by earlier
+  layers
+Use separate sandboxes when checks need isolation:
+- variants are running in parallel
+- a test mutates global state
+- credentials or network access differ by phase
+- one action can corrupt the workdir for later checks
+The important rule is explicit ownership: one driver/workdir means shared state;
+multiple drivers/workdirs means isolated state. Do not rely on hidden global
+state.
+### Coding Agent
+```ts
+interface CodingState {
+  workdir: string
+  diffSummary: string
+  tests: { typecheck: boolean; unit: boolean; e2e?: boolean }
+  generatedFiles: string[]
+  runtimeTrace?: string
+  reviewerFindings: string[]
+}
+```
+```ts
+type CodingAction =
+  | { type: 'patch'; instruction: string }
+  | { type: 'run_tests'; command: string }
+  | { type: 'review_diff' }
+  | { type: 'ask_user'; question: string }
+```
+Validators:
+- expected files exist
+- typecheck/build/tests pass
+- generated app or agent completes a representative runtime scenario
+- no hardcoded fake success or placeholder integrations
+- reviewer findings resolved or explicitly accepted
+Stop policy:
+- stop when build and runtime validators pass
+- stop on no progress after repeated patch/test cycles
+- ask user when task intent or credentials are missing
+### Browser / Computer-Use Agent
+Use this shape when the agent controls a browser, desktop session, or remote
+computer and needs to complete a task end-to-end.
+```ts
+interface ComputerUseState {
+  url?: string
+  goal: string
+  screenshot?: string
+  accessibilityTree?: unknown
+  completedSteps: string[]
+  openIssues: string[]
+  assertions: Array<{ id: string; passed: boolean; detail?: string }>
+}
+```
+```ts
+type ComputerUseAction =
+  | { type: 'navigate'; url: string }
+  | { type: 'click'; selectorOrDescription: string }
+  | { type: 'type'; selectorOrDescription: string; text: string }
+  | { type: 'inspect' }
+  | { type: 'ask_user'; question: string }
+```
+Validators:
+- required page or app state is reached
+- no blocking errors are visible
+- expected text, data, or UI state is present
+- screenshots or accessibility tree support the claimed success
+- repeated clicks or navigation loops are detected
+Stop policy:
+- stop when objective UI assertions pass
+- stop on repeated action or no-progress policies
+- ask user when credentials, permissions, or ambiguous choices block progress
+### Research / Documentation Agent
+Use this shape when the agent produces a brief, explanation, migration guide, or
+technical research note.
+```ts
+interface ResearchState {
+  question: string
+  draft: string
+  sources: Array<{ url: string; title?: string; relevant: boolean }>
+  unsupportedClaims: string[]
+  reviewerFindings: string[]
+}
+```
+```ts
+type ResearchAction =
+  | { type: 'search'; query: string }
+  | { type: 'read_source'; url: string }
+  | { type: 'revise_draft'; failures: string[] }
+  | { type: 'ask_user'; question: string }
+```
+Validators:
+- every important claim has a source
+- sources are relevant and current enough for the task
+- unsupported claims are removed or marked as uncertain
+- reviewer findings are resolved
+- final output answers the original question
+Stop policy:
+- stop when source coverage and reviewer checks pass
+- ask user when the question scope is ambiguous
+- stop on repeated research queries with no new evidence
+## Integration Checklist
+For a new downstream integration:
+1. Define typed state.
+2. Define domain actions.
+3. Write objective validators first.
+4. Add subjective judges only for judgment-heavy dimensions.
+5. Decide which actions require approval before execution.
+6. Add cost extraction for expensive actions.
+7. Add no-progress and repeated-action policies.
+8. Emit to a `TraceStore` in CI and production-like evals.
+9. Keep the adapter downstream until it proves reusable.