npm - @tangle-network/agent-eval - Versions diffs - 0.20.10 → 0.20.11 - Mend

@tangle-network/agent-eval 0.20.10 → 0.20.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md +78 -4
package/dist/openapi.json +1 -1
package/docs/product-eval-adoption.md +194 -0
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -1,10 +1,10 @@
 # @tangle-network/agent-eval
-Trace-first evaluation infrastructure for agent systems.
+Evaluation infrastructure for agent systems.
-`agent-eval` provides the contracts and runtime primitives for measuring agent
-behavior: traces, harnesses, verifier pipelines, judges, datasets, holdout
-gates, failure classification, optimization loops, and release reports.
+`agent-eval` gives agent products a reusable way to record what happened,
+verify outcomes, classify failures, compare variants, optimize prompts or
+policies, and make release decisions from evidence instead of anecdotes.
 It does not own your product state, credentials, UI, or model routing. Product
 teams keep those boundaries; this package standardizes how runs are recorded,
@@ -15,7 +15,9 @@ checked, compared, and promoted.
 - [When To Use It](#when-to-use-it)
 - [Architecture](#architecture)
 - [Install](#install)
+- [Quick Start](#quick-start)
 - [Core Primitives](#core-primitives)
+- [Adoption Path](#adoption-path)
 - [Examples](#examples)
 - [Documentation](#documentation)
 - [Development](#development)
@@ -80,6 +82,59 @@ cd clients/python
 pip install -e .
 ```
+## Quick Start
+Wrap the real product loop first. Do not build a toy eval path that users never
+exercise.
+```ts
+import {
+  objectiveEval,
+  runAgentControlLoop,
+} from '@tangle-network/agent-eval'
+const result = await runAgentControlLoop({
+  intent: task.prompt,
+  budget: { maxSteps: 8, maxWallMs: 180_000, maxCostUsd: 2 },
+  async observe() {
+    return productAdapter.readState(task.id)
+  },
+  async validate({ state }) {
+    return [
+      objectiveEval({
+        id: 'build-passes',
+        passed: state.build.exitCode === 0,
+        severity: 'critical',
+        metadata: state.build,
+      }),
+      objectiveEval({
+        id: 'preview-serves',
+        passed: state.preview.httpStatus === 200,
+        severity: 'critical',
+      }),
+    ]
+  },
+  async decide({ evals }) {
+    return evals.every((evalResult) => evalResult.passed)
+      ? { type: 'stop', reason: 'all critical checks passed' }
+      : { type: 'continue', action: { type: 'repair' }, reason: 'checks failed' }
+  },
+  async act(action) {
+    return productAdapter.runAgentStep(task.id, action)
+  },
+})
+await productAdapter.storeControlResult(task.id, result)
+```
+Once this loop represents production behavior, convert completed runs into
+feedback trajectories, split them into train/dev/test/holdout sets, and run
+multi-shot optimization against the same adapter.
 ## Core Primitives
 | Primitive | Purpose |
@@ -101,6 +156,24 @@ pip install -e .
 `NoopResearcher` is a fail-loud sentinel for wiring tests. Production systems
 should implement `Researcher` directly or use `CallbackResearcher`.
+## Adoption Path
+1. Choose one real workflow: code generation, browser task, research task,
+   workflow builder, voice interaction, or domain agent task.
+2. Write a product adapter that can observe state and execute one agent step.
+3. Add deterministic validators first: build, test, serve, schema, policy,
+   permission, retrieval, and deployment checks.
+4. Add LLM judges only for subjective quality that deterministic checks cannot
+   measure.
+5. Emit traces and convert successful and failed attempts into
+   `FeedbackTrajectory` records.
+6. Build train/dev/test/holdout scenarios from those trajectories.
+7. Run `runMultiShotOptimization()` or prompt/code evolution on train/dev.
+8. Promote only when test/holdout gates and real product telemetry improve.
+For a complete product integration guide, see
+[Product Eval Adoption](./docs/product-eval-adoption.md).
 ## Examples
 Runnable examples live in the repository's
@@ -121,6 +194,7 @@ tested, and copied without turning this page into a tutorial.
 - [Concepts](./docs/concepts.md)
 - [Feature Guide](./docs/feature-guide.md)
+- [Product Eval Adoption](./docs/product-eval-adoption.md)
 - [Control Runtime](./docs/control-runtime.md)
 - [Knowledge Readiness](./docs/knowledge-readiness.md)
 - [Multi-Shot Optimization](./docs/multi-shot-optimization.md)

package/dist/openapi.json CHANGED Viewed

@@ -2,7 +2,7 @@
   "openapi": "3.1.0",
   "info": {
     "title": "@tangle-network/agent-eval — wire protocol",
-    "version": "0.20.10",
+    "version": "0.20.11",
     "description": "HTTP and stdio RPC interface to agent-eval. The TypeScript runtime is the source of truth; this spec is the contract that cross-language clients (Python, Rust, Go) generate from.\n\nWire-protocol version: 1.0.0. Bumps on breaking changes to request/response schemas.",
     "contact": {
       "name": "Tangle Network",

package/docs/product-eval-adoption.md ADDED Viewed

@@ -0,0 +1,194 @@
+# Product Eval Adoption
+This guide is for teams adding `@tangle-network/agent-eval` to a real agent
+product. The package supplies evaluation contracts and runtime primitives. Your
+product supplies the actual workflow adapter, state, credentials, tools, UI, and
+storage.
+## Goal
+Use the same loop for production, replay, and optimization:
+```txt
+real user task
+  -> product adapter observes state
+  -> validators and judges grade state
+  -> control loop decides next action
+  -> product agent acts in the real environment
+  -> trace + feedback trajectory are stored
+  -> datasets and optimizers replay the same adapter
+```
+If production and eval use different loops, benchmark gains will not transfer.
+## What The Product Owns
+The product owns:
+- task state and domain models
+- credentials, tenant policy, approval, and side-effect rules
+- browser, sandbox, CLI, connector, or voice drivers
+- database and trace persistence
+- user/reviewer feedback collection
+- deployment and live canary routing
+- model gateway configuration
+`agent-eval` owns:
+- trace, run, dataset, feedback, and score contracts
+- control-loop mechanics
+- verifier and judge orchestration
+- failure taxonomy
+- paired statistics and holdout gates
+- optimizer inputs and promotion reports
+## Minimal Production Adapter
+Start with a small adapter that mirrors one real workflow.
+```ts
+interface ProductEvalAdapter<TState, TAction> {
+  observe(taskId: string): Promise<TState>
+  validate(state: TState): Promise<ControlEvalResult[]>
+  decide(input: {
+    state: TState
+    evals: ControlEvalResult[]
+    history: unknown[]
+  }): Promise<TAction | 'stop'>
+  act(taskId: string, action: TAction): Promise<void>
+}
+```
+Keep the adapter product-owned until at least two products need the same shape.
+## Validator Order
+Use deterministic checks before judges.
+1. **State validity**: schema, required files, required DB rows, required
+   connections.
+2. **Runtime gates**: install, build, typecheck, tests, serve, deploy smoke.
+3. **Policy gates**: approvals, side effects, budget, credentials, data
+   freshness.
+4. **Behavior gates**: browser flows, API calls, generated app preview, voice
+   transcript checks.
+5. **Semantic judges**: intent fit, quality, completeness, safety,
+   professional correctness.
+Semantic judges should never turn a failed build into a pass.
+## Traces And Feedback
+Every serious run should record:
+- task id and scenario id
+- git commit
+- model and provider
+- prompt/config hashes
+- tool calls and retrieval spans
+- build/test/deploy output
+- cost, latency, and token use
+- user/reviewer feedback
+- final outcome and failure class
+Convert runs into `FeedbackTrajectory` records so normal product usage becomes
+replayable eval data.
+```txt
+production run -> feedback trajectory -> dataset scenario -> optimizer row
+```
+## Datasets And Holdouts
+Use four splits:
+- `train`: optimizer search.
+- `dev`: tuning and threshold selection.
+- `test`: normal reporting.
+- `holdout`: promotion-only gate.
+Do not inspect or tune against holdout failures during optimization. If a
+holdout failure reveals a real product bug, fix the bug and rotate the holdout
+set with a signed note.
+## Optimization
+Use `runMultiShotOptimization()` when the system is a multi-step agent, not a
+single prompt.
+Good optimization targets:
+- system prompt
+- tool descriptions
+- retrieval policy
+- data acquisition policy
+- user-question policy
+- evaluator threshold
+- agent topology
+- scaffold/template choice
+Bad optimization targets:
+- hidden holdout examples
+- production credentials
+- brittle string checks that do not match user value
+- fake workflows that do not call the product adapter
+Use actionable side information so the optimizer knows whether a failure belongs
+to prompt, tools, retrieval, data acquisition, sandbox, evaluator, or product
+runtime.
+## Release Gate
+A launch or promotion should require:
+- enough runs for the target risk level
+- paired improvement over the current baseline
+- no critical regression on test
+- holdout pass or explicit rejection
+- cost and latency within budget
+- no unresolved canary or contamination failures
+- trace evidence for representative successes and failures
+- human-readable report with failure clusters and next actions
+`evaluateReleaseConfidence()` and the paired statistics helpers provide the
+decision data. The product decides the business threshold.
+## Product Patterns
+### Coding Or Builder Agent
+Use sandbox/build/test/serve/browser validators. Add intent and semantic
+concept judges only after the generated app runs.
+### Browser Agent
+Record browser steps, screenshots, network errors, console errors, and final
+state. Use deterministic DOM/API assertions before visual or semantic judges.
+### Domain Agent
+Use domain fixtures, jurisdiction/date metadata, retrieval spans, and
+professional judges. Fail missing/stale evidence separately from bad reasoning.
+### Workflow Or Integration Agent
+Use `@tangle-network/agent-integrations` manifests as readiness inputs. Gate
+missing connections, missing scopes, approval-required writes, and stale tokens
+before blaming the agent prompt.
+### Voice Agent
+Record transcript, timing, interruptions, tool calls, and task outcome. Judge
+conversation quality separately from tool success and policy compliance.
+## Anti-Patterns
+- Evaluating only final prose for an agent that actually builds, browses, or
+  calls tools.
+- Letting an LLM judge override failed tests.
+- Optimizing on examples that users will never hit.
+- Recording traces as logs but never converting them to datasets.
+- Calling every failure a prompt failure when context, data, auth, or runtime
+  readiness was missing.
+- Shipping reports without run ids, commits, model ids, or evidence links.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@tangle-network/agent-eval",
-  "version": "0.20.10",
+  "version": "0.20.11",
   "description": "Trace-first evaluation infrastructure for agent systems: traces, harnesses, verifier pipelines, judges, datasets, gates, optimization, and reporting.",
   "homepage": "https://github.com/tangle-network/agent-eval#readme",
   "repository": {