npm - @tangle-network/agent-eval - Versions diffs - 0.17.2 → 0.18.0 - Mend

@tangle-network/agent-eval 0.17.2 → 0.18.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/README.md +24 -16
package/dist/index.d.ts +271 -75
package/dist/index.js +393 -16
package/dist/index.js.map +1 -1
package/docs/concepts.md +155 -0
package/docs/control-runtime.md +351 -0
package/docs/feature-guide.md +213 -0
package/docs/feedback-trajectories.md +193 -0
package/docs/multi-shot-optimization.md +122 -0
package/docs/wire-protocol.md +199 -0
package/package.json +21 -14

package/docs/feature-guide.md ADDED Viewed

@@ -0,0 +1,213 @@
+# Feature Guide
+This page explains the main `agent-eval` primitives in plain English first,
+then shows when to use each one.
+## ELI5
+LLM agents can write code, drafts, research, plans, and actions. The hard part
+is knowing whether they actually did a good job, whether they should keep
+trying, and whether a change made them better or worse.
+`agent-eval` gives you reusable tools for that:
+- **Judges** grade one output.
+- **Verifiers** run several checks in order.
+- **Control loops** let an agent keep working until it passes, gets blocked, or
+  hits a budget.
+- **Feedback trajectories** turn normal user approvals/rejections into training
+  and eval data.
+- **Datasets and holdouts** keep examples organized so you do not overfit.
+- **Optimizers and mutation loops** try prompt/signature/code variants and keep
+  the ones that really improve.
+- **Traces and telemetry** show what happened, step by step.
+## Which Primitive Should I Use?
+| Problem | Use | Why |
+| --- | --- | --- |
+| “Did this single answer/draft pass?” | Judge or rubric | Fast quality signal for one artifact. |
+| “Does generated code actually work?” | `BuilderSession`, `MultiLayerVerifier`, sandbox harness | Build/test/runtime gates catch failures judges miss. |
+| “Should the agent keep trying?” | `runAgentControlLoop` | Budgeted `observe -> validate -> decide -> act` runtime. |
+| “The agent should propose, verify, review, and revise.” | `runProposeReviewAsControlLoop` | Reusable preset over the generic control loop. |
+| “Human feedback should become reusable eval data.” | `FeedbackTrajectory` | Captures approvals, rejections, edits, choices, metrics, and policy blocks. |
+| “Can this action run, or does it need approval?” | `evaluateActionPolicy` | Generic preflight for side effects, budgets, and required evidence. |
+| “I need train/dev/test/holdout examples.” | `Dataset` plus feedback trajectory conversion | Stable splits and contamination control. |
+| “Which prompt or signature wins?” | `PromptOptimizer`, `OptimizationLoop`, steering optimizers | Runs variants on scenarios and compares scores. |
+| “Improve a multi-turn agent over real task traces.” | `runMultiShotOptimization` | GEPA-style trajectory optimization with ASI and held-out promotion. |
+| “Improve prompts, then code if prompts plateau.” | `runPromptEvolution`, composite mutator, code mutator | Bounded evolution with telemetry and lineage. |
+| “Find why a regression happened.” | bisector, traces, run records | Narrows changes and preserves evidence. |
+| “Expose evals to another language.” | Wire protocol and Python client | HTTP/RPC boundary for non-TypeScript apps. |
+## Integration Patterns
+### Recommended Agent Product Shape
+Use this shape when the product needs to keep pushing work forward instead of
+only answering once:
+```text
+user intent
+  -> product adapter observes typed state
+  -> runAgentControlLoop decides the next driver action
+  -> worker executes in the real product environment
+  -> validators and judges grade the new state
+  -> FeedbackTrajectory records user/environment/reviewer signal
+  -> datasets and optimizers replay the same adapter
+```
+The important part is that production and eval do not use different loops. The
+adapter can swap dependencies (real user session, replay fixture, sandbox), but
+the state shape, validators, actions, budgets, and stop policies should stay
+the same. That is what makes benchmark gains transfer to real usage.
+### Agent Runtime Integration
+Use when you have a coding, browser, computer-use, research, or documentation agent.
+1. Represent the current task state.
+2. Validate that state with objective checks first and judges second.
+3. Use `runAgentControlLoop` to decide the next action.
+4. Record user feedback as `FeedbackTrajectory`.
+5. Convert trajectories into datasets and optimizer rows.
+Result:
+```text
+normal agent usage -> labeled examples -> replay/eval -> optimization
+```
+Implementation ownership:
+- Put reusable loop mechanics, trace schemas, budget accounting, split
+  assignment, and optimizer row conversion in `agent-eval`.
+- Put product state readers, action executors, approval policy, credentials,
+  workspace paths, and UI-specific storage in the downstream repo.
+- Promote a product adapter into `agent-eval` only after at least two products
+  need the same adapter shape.
+### Code Generator
+Use when an agent writes or patches a repo.
+1. Use `BuilderSession` or `MultiLayerVerifier`.
+2. Always run static gates like typecheck/build/tests.
+3. Add semantic judges only after build gates pass.
+4. Store traces and run records for regression debugging.
+Result:
+```text
+generated code -> build/test/runtime gates -> score -> ship or revise
+```
+### Prompt/Signature Optimizer
+Use when you want Ax/GEPA-style improvement.
+1. For variable-length agent tasks, use `runMultiShotOptimization`.
+2. Build search/dev/test/holdout splits from the real product loop.
+3. Score full trajectories, not just final text.
+4. Emit actionable side information for failures the mutator can fix.
+5. Promote only `promotedVariant`, never a rejected `searchBestVariant`.
+6. Keep run records with prompt hash, model, config, cost, and commit.
+Result:
+```text
+candidate variant -> repeated evals -> statistical comparison -> promotion gate
+```
+Do not optimize a toy harness if users run a different product loop. Build
+training examples from `FeedbackTrajectory` and `controlRunToFeedbackTrajectory`,
+then replay them through the same product adapter used at runtime. Train on
+`train`, tune on `dev`, report on `test`, and keep `holdout` untouched until a
+promotion decision.
+### Human Feedback Data
+Use when operator or reviewer interaction should create labels.
+Capture:
+- approve/reject
+- select A/B/C
+- edit/rewrite
+- rank/rate
+- comment
+- metric outcome
+- policy block or budget block
+Store as `FeedbackTrajectory`, then derive:
+- preference memory for the next run
+- dataset scenarios for regression
+- optimizer rows for prompt/signature/code changes
+- holdout examples to detect overfitting
+## Feature Map
+| Area | Key exports | Best for | Notes |
+| --- | --- | --- | --- |
+| Judging | `createCustomJudge`, `createAntiSlopJudge`, wire rubrics | Content, voice, semantic quality | Pair with objective checks when possible. |
+| Verification | `MultiLayerVerifier`, `JudgeRunner`, sandbox harness | Code and multi-step gates | Do not let semantic judges override failed builds. |
+| Control | `runAgentControlLoop`, `objectiveEval`, `subjectiveEval` | Long-running agent tasks | Supports budgets, cost, stop policies, trace spans. |
+| Propose/review | `runProposeReview`, `runProposeReviewAsControlLoop` | Iterative artifact repair | Good for code, docs, plans, briefs. |
+| Feedback data | `FeedbackTrajectory`, stores, converters | Human/environment labels | Domain adapters live in downstream repos. |
+| Action policy | `evaluateActionPolicy` | Approval/budget preflight | Blocks or labels actions before `act()`. |
+| Datasets | `Dataset`, holdout tools, canaries | Train/dev/test/holdout corpora | Keeps optimization honest. |
+| Optimization | `PromptOptimizer`, `OptimizationLoop`, steering optimizers | Prompt/signature comparison | Use held-out gates before promotion. |
+| Evolution | prompt/code mutators, sandbox pool, telemetry | Autoresearch and mutation loops | Use budgets and lineage; do not run unbounded. |
+| Telemetry | `TraceStore`, OTLP, file sinks | Audit and replay | Treat traces as evidence, not just logs. |
+| Reporting | summaries, pareto, cost tracker | Decision support | Useful for PRs, launch gates, research notes. |
+## Guardrails
+- Prefer deterministic checks before LLM judges.
+- Keep holdout data out of optimization.
+- Record model, prompt hash, config hash, commit, and cost for every serious
+  run.
+- Budget every loop: steps, wall time, and dollars.
+- Treat external side effects as downstream policy. The runtime can stop loops,
+  but your adapter decides what requires approval.
+- Store user feedback with enough context to replay it later.
+- Run harnesses and judges in the same sandbox when they need shared files,
+  logs, screenshots, or browser state. Use separate sandboxes for parallel
+  variants or destructive checks.
+## Same-Sandbox Example
+`examples/same-sandbox-harness/` shows the common coding/browser pattern:
+```text
+one sandbox/workdir -> install/build/test -> inspect evidence -> emit judge span
+```
+Use this when a judge needs evidence produced by earlier harness phases. Use
+isolated sandboxes when variants run in parallel or a phase can corrupt the
+workspace.
+- Treat telemetry as evidence, not control flow. A trace sink outage should be
+  visible in `runtimeErrors`, but it should not stop the worker from completing
+  the user task.
+## Highest-ROI Adoption Order
+1. Wrap one real product workflow in `runAgentControlLoop`.
+2. Add objective validators that can fail without an LLM.
+3. Emit traces and convert completed runs into feedback trajectories.
+4. Capture explicit user/reviewer feedback on attempts.
+5. Replay those trajectories as train/dev/test/holdout scenarios.
+6. Run prompt/signature optimization against that replay adapter.
+7. Promote only when held-out trajectories and product telemetry both improve.
+## What Stays Out Of Core
+Domain-specific adapters should usually stay in downstream repos until they prove
+reusable:
+- browser site-specific actions
+- repo-specific coding commands
+- workspace-specific storage paths
+Core should provide shapes, stores, runners, scoring, traces, and converters.
+Downstream integrations provide domain state, policy, tools, and storage.

package/docs/feedback-trajectories.md ADDED Viewed

@@ -0,0 +1,193 @@
+# Feedback Trajectories
+Feedback trajectories are the generic shape behind feedback-driven learning
+loops:
+```text
+candidate artifact/action -> user/judge/environment feedback -> revision chain -> labeled example -> replay/eval/optimization
+```
+They are deliberately domain-neutral. Browser task completion, code patch
+review, and research brief revision all fit the same structure.
+## Core Shape
+```ts
+import {
+  createFeedbackTrajectory,
+  summarizePreferenceMemory,
+  feedbackTrajectoriesToDatasetScenarios,
+  feedbackTrajectoriesToOptimizerRows,
+} from '@tangle-network/agent-eval'
+const trajectory = createFeedbackTrajectory({
+  projectId: 'research-agent',
+  scenarioId: 'brief-review',
+  task: {
+    intent: 'Revise a research brief until it is specific and sourced.',
+    context: { audience: 'technical reviewer' },
+  },
+  attempts: [
+    {
+      id: 'attempt-1',
+      stepIndex: 0,
+      artifactType: 'research',
+      artifact: { summary: 'Initial brief with weak sourcing.' },
+      createdAt: new Date().toISOString(),
+    },
+  ],
+  labels: [
+    {
+      source: 'user',
+      kind: 'revision_request',
+      value: 'needs stronger evidence',
+      reason: 'add primary sources and remove unsupported claims',
+      severity: 'error',
+      createdAt: new Date().toISOString(),
+    },
+  ],
+})
+const memory = summarizePreferenceMemory([trajectory])
+const scenarios = feedbackTrajectoriesToDatasetScenarios([trajectory])
+const optimizerRows = feedbackTrajectoriesToOptimizerRows([trajectory])
+```
+## What Belongs In Core
+`agent-eval` owns the substrate:
+- trajectory and label schemas
+- in-memory and JSONL-backed stores
+- deterministic train/dev/test/holdout splitting
+- JSONL import/export
+- conversion into `DatasetScenario`
+- conversion into optimizer rows
+- preference-memory distillation
+- conversion from `runAgentControlLoop` results
+Downstream repos own domain adapters:
+- how review actions map to labels
+- how generated artifacts are represented
+- which side effects require approval
+- which budgets and metrics matter
+- where task-local data is stored
+## Label Sources
+Labels can come from multiple places:
+| Source | Example |
+| --- | --- |
+| `user` | Approved a draft, rejected a draft, selected option B. |
+| `judge` | LLM/domain judge scores usefulness or voice. |
+| `environment` | Browser task completed, tests passed, API call succeeded. |
+| `metric` | A measured outcome improved or regressed. |
+| `policy` | Budget cap blocked execution, approval required. |
+| `system` | Control loop passed or failed. |
+The most useful trajectories combine several label sources: user preference
+plus objective outcome plus later metric result.
+## Multi-Shot Revision
+Store every attempt, not only the final artifact:
+```text
+draft 1 -> user rejects with reason -> draft 2 -> judge passes -> user approves -> metric outcome
+```
+This supports tests that ask "does the agent improve after feedback?" instead
+of only "was the final answer good?"
+## Control Runtime Bridge
+`controlRunToFeedbackTrajectory` turns a finished control-loop run into a
+trajectory:
+```ts
+const run = await runAgentControlLoop({ ... })
+const trajectory = controlRunToFeedbackTrajectory(run, {
+  projectId: 'coding-agent',
+  scenarioId: 'fix-typecheck',
+  artifactType: 'code',
+})
+```
+Use this for tasks where the agent works autonomously and the labels come from
+validators, policies, or environment outcomes. Use direct trajectory recording
+for review-heavy workflows where a person approves, rejects, edits, ranks, or
+comments.
+## Optimization Loop
+The same stored trajectories can feed three layers:
+1. **Immediate memory**: distill labels into short instructions.
+2. **Replay/eval**: convert trajectories into dataset scenarios.
+3. **Prompt/signature/code optimization**: convert trajectories into optimizer
+   rows and evaluate candidate variants on train/dev/holdout splits.
+That is the reusable pattern:
+```text
+normal agent usage -> labeled trajectory -> eval dataset -> optimizer input -> replay against held-out feedback
+```
+## Replay Adapter
+Use `replayFeedbackTrajectory` when a stored trajectory should be tested against
+a new prompt, signature, policy, or code path:
+```ts
+const result = await replayFeedbackTrajectory(trajectory, {
+  async replay(item) {
+    const run = await runCandidateOn(item.task, item.attempts)
+    return {
+      pass: run.pass,
+      score: run.score,
+      labels: run.pass ? [] : [{
+        source: 'environment',
+        kind: 'reject',
+        value: false,
+        reason: run.reason,
+        severity: 'error',
+        createdAt: new Date().toISOString(),
+      }],
+      outcome: { success: run.pass, score: run.score, detail: run.reason },
+    }
+  },
+})
+```
+Replay adapters live downstream because only the integration knows how to
+re-run a browser task, coding patch, or research brief.
+## Split Discipline
+Treat feedback data like product analytics with labels:
+- `train`: examples the optimizer can directly learn from.
+- `dev`: examples used to choose among candidate variants.
+- `test`: examples used for honest reporting after a variant is chosen.
+- `holdout`: examples kept untouched until promotion or release review.
+Do not let an optimizer see `test` or `holdout` examples through prompt text,
+preference memory, few-shot examples, or manual tuning. If a trajectory becomes
+part of memory, mark which split it came from and keep that memory out of
+held-out evaluation.
+## What To Store
+A useful trajectory has enough information to replay the decision later:
+- user intent and relevant context
+- every attempted artifact or action
+- objective validation results
+- user/reviewer/environment labels with reasons
+- measured outcome when one exists
+- model, prompt/config hash, code commit, and cost in metadata
+If the record cannot answer "what did the agent try, why was it judged wrong,
+and what changed next?", it is not yet useful training data.

package/docs/multi-shot-optimization.md ADDED Viewed

@@ -0,0 +1,122 @@
+# Multi-Shot Optimization
+`runMultiShotOptimization` is the public adapter for GEPA-style optimization over
+variable-length agent conversations.
+Use it when the thing you want to improve is not a single model call. Typical
+targets are agent system prompts, tool descriptions, routing policies, retrieval
+plans, or app-specific scaffolding that affects an entire task trajectory.
+The primitive is intentionally small. Your app owns the domain logic:
+- `seedVariants`: prompt/config/tool-policy candidates
+- `runner`: executes one complete task trajectory for one variant
+- `scorer`: scores the trajectory and emits actionable side information
+- `mutateAdapter`: proposes new variants from top and bottom trials
+`agent-eval` owns the release-critical glue:
+- stable paired seeds
+- search-split prompt evolution
+- cost/score Pareto objectives
+- failed-run conversion into failed trials
+- ASI projection into reflection traces and numeric metrics
+- optional paired holdout gating through `HeldOutGate`
+- validated `RunRecord` rows for promotion evidence
+## Result Contract
+The return shape separates discovery from promotion:
+- `searchBestVariant`: best variant on the optimizer-visible search scenarios
+- `searchBestAggregate`: aggregate for that search winner
+- `promotedVariant`: variant callers should ship
+- `promotedAggregate`: aggregate for the promoted variant
+- `gate`: holdout decision and evidence, or `null` when no gate ran
+If a holdout gate is configured and rejects the search winner,
+`promotedVariant` is the baseline. Do not ship `searchBestVariant` directly
+unless you intentionally run without a holdout gate.
+## Actionable Side Information
+The scorer should return `asi` rows for concrete failure modes:
+```ts
+{
+  expectationId: 'used-primary-sources',
+  message: 'The final answer cited secondary summaries instead of primary sources.',
+  severity: 'error',
+  responsibleSurface: 'retrieval-policy',
+  suggestion: 'Prefer primary-source domains during source-gathering turns.',
+}
+```
+These rows become:
+- reflection expectations via `trialTraceFromMultiShotTrial`
+- aggregate metrics like `asi.error` and `surface.retrieval-policy`
+- trace evidence available to downstream reports
+This is the main reason to use this primitive instead of reducing each run to a
+single scalar reward.
+## Holdout Discipline
+For release gates, configure `gate`. The first seed variant is the baseline and
+`gate.gate.baselineKey` must match its id.
+Holdout scenarios must be disjoint from `searchScenarioIds`. The adapter runs
+baseline and candidate with the same `(scenarioId, rep)` seed, validates every
+row with `validateRunRecord`, then asks `HeldOutGate` whether to promote.
+When `gate.searchScenarioIds` is omitted, the adapter reuses
+`searchScenarioIds` for the overfit-gap check.
+## Minimal Shape
+```ts
+import {
+  runMultiShotOptimization,
+  trialTraceFromMultiShotTrial,
+  type MultiShotVariant,
+} from '@tangle-network/agent-eval'
+type Payload = { systemPrompt: string }
+const baseline: MultiShotVariant<Payload> = {
+  id: 'baseline',
+  label: 'baseline',
+  generation: 0,
+  payload: { systemPrompt: currentPrompt },
+}
+const result = await runMultiShotOptimization<Payload>({
+  runId: `research-agent-${Date.now()}`,
+  target: 'research-agent-system-prompt',
+  seedVariants: [baseline],
+  searchScenarioIds: searchScenarios.map((s) => s.id),
+  reps: 2,
+  generations: 4,
+  populationSize: 4,
+  scoreConcurrency: 4,
+  runner: {
+    async run({ variant, scenarioId, seed }) {
+      return runYourAgentToCompletion({ scenarioId, seed, prompt: variant.payload.systemPrompt })
+    },
+  },
+  scorer: {
+    async score({ run }) {
+      return scoreFullTrajectory(run.trace)
+    },
+  },
+  mutateAdapter: {
+    async mutate({ parent, bottomTrials, childCount, generation }) {
+      const traces = bottomTrials.map((t) => trialTraceFromMultiShotTrial(t))
+      return proposePromptMutations({ parent, traces, childCount, generation })
+    },
+  },
+})
+deploy(result.promotedVariant.payload)
+```