npm - @tangle-network/agent-eval - Versions diffs - 0.24.0 → 0.25.0 - Mend

@tangle-network/agent-eval 0.24.0 → 0.25.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

package/CHANGELOG.md +65 -0
package/README.md +71 -0
package/dist/{chunk-SY6WAAAD.js → chunk-5LBB5B3Z.js} +296 -5
package/dist/chunk-5LBB5B3Z.js.map +1 -0
package/dist/{chunk-VRJVTXRV.js → chunk-EDUKQ5AM.js} +85 -85
package/dist/{chunk-VRJVTXRV.js.map → chunk-EDUKQ5AM.js.map} +1 -1
package/dist/{chunk-OHEPNJQN.js → chunk-JLZQWFV3.js} +65 -1
package/dist/chunk-JLZQWFV3.js.map +1 -0
package/dist/cli.js +1 -1
package/dist/index.d.ts +311 -11
package/dist/index.js +695 -2
package/dist/index.js.map +1 -1
package/dist/openapi.json +491 -1
package/dist/optimization.d.ts +2 -2
package/dist/optimization.js +1 -1
package/dist/pipelines/index.js +3 -67
package/dist/pipelines/index.js.map +1 -1
package/dist/{release-report-TDPn1cxq.d.ts → release-report-BNgMdqPF.d.ts} +1 -1
package/dist/reporting.d.ts +2 -2
package/dist/{researcher-CUOiGcGv.d.ts → researcher-BPT8x_NT.d.ts} +1 -1
package/dist/rl.d.ts +3 -3
package/dist/{summary-report-BXGs_9V0.d.ts → summary-report-C7VPYEj2.d.ts} +1 -1
package/dist/wire/index.d.ts +347 -3
package/dist/wire/index.js +19 -1
package/package.json +1 -1
package/dist/chunk-OHEPNJQN.js.map +0 -1
package/dist/chunk-SY6WAAAD.js.map +0 -1

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,70 @@
 # Changelog
+## 0.25.0 — ProductionLoop primitive: close the eval → prod → eval cycle
+This release ships the **orchestration layer** that turns the existing
+eval substrate into a continuously-improving production system. Static
+prompts decay; today's regulation flips tomorrow. The pieces to close
+the loop were already in the package (`runMultiShotOptimization`,
+`failureClusterView`, `evaluateReleaseConfidence`, `extractPreferences`,
+`FeedbackTrajectoryStore`, `TraceStore`); this release adds the one
+clean primitive that wires them together end-to-end.
+### Added
+- **`runProductionLoop({ ... })`** (`src/production-loop.ts`,
+  `@experimental`) — one call = one cycle. Ingests production traces
+  and feedback, clusters failures, runs evolve against the worst
+  cluster, gates with `HeldOutGate` + `evaluateReleaseConfidence`
+  (fail-closed), and — when wired with an `AutoPrClient` — opens a PR
+  with the improved prompt. Idempotent + replayable: same `runId`
+  yields the same plan. Cron / GitHub Actions are the consumer's job;
+  the primitive doesn't own scheduling.
+- **`proposeAutomatedPullRequest(client, input)`** + two transports
+  (`src/auto-pr.ts`, `@experimental`):
+    - `httpGithubClient({ token, ... })` — direct REST against
+      `api.github.com`, no extra deps. Idempotent on branch name:
+      existing open PRs are returned, not duplicated.
+    - `ghCliClient({ ... })` — shells out to `gh` for environments
+      where developer auth state is already configured.
+  Both validate inputs (no `..` paths, no whitespace branches, no
+  duplicate file changes) and surface `ValidationError` / `ConfigError`
+  from the typed taxonomy.
+- **`POST /v1/feedback` + `POST /v1/traces/ingest`** wire endpoints
+  (`src/wire/`). Both Zod-validated, both append to the configured
+  store (`FeedbackTrajectoryStore` / `TraceStore`). 503 when no store
+  is wired (fail loud, not silent). Traces ingest accepts both
+  `application/json` (`{events:[...]}`) and `application/x-ndjson` for
+  streaming production runtimes. Schemas (`TraceEvent`,
+  `FeedbackTrajectory`, `TracesIngestRequest/Response`,
+  `FeedbackIngestResponse`) added to `openapi.json` for cross-language
+  clients.
+- **Optional bearer-token auth** on the wire server, configured via
+  `createApp({ auth: { bearer: '...' } })` or as a verifier function
+  for rotating tokens. `/healthz` and `/v1/version` remain unprotected
+  (regression: never lock monitoring out of the runtime).
+- **`examples/production-loop/`** — synthetic end-to-end demo wiring
+  the loop against in-memory trace + feedback stores and a fake
+  auto-PR client. Shows the failure-cluster trigger, the evolve round,
+  the gate verdict, and the PR-shaped output without requiring
+  credentials or a live model.
+### Changed
+- **Wire server** (`createApp(opts)`) now accepts optional
+  `IngestionStores` (`{ traceStore?, feedbackStore? }`) and `auth`.
+  Existing zero-arg callers continue to work — judge / rubrics /
+  version / healthz are unchanged.
+### Status tags
+- Every new export is `@experimental` initially. Pin the patch version
+  if you depend on it. All other 0.24.0 stability tags are preserved.
 ## 0.24.0 — DX cleanup: framing, stability tags, lint, taxonomy, strict indices
 This release is **DX + correctness**. No production behavior moved; consumer

package/README.md CHANGED Viewed

@@ -88,6 +88,75 @@ await product.storeEvalResult(task.id, result)
 Same loop shape in production, replay, benchmark, and optimization. Swap the
 dependencies behind `observe()` and `act()`, never the eval contract.
+## Production loop — close the eval → prod → eval cycle (0.25.0)
+Static prompts decay. Yesterday's FTC rule flips today; yesterday's tool quirk
+becomes today's incident. The production agents that win are the ones that
+**continuously re-train against live failure modes**.
+`runProductionLoop` is the orchestration layer that wires the existing eval
+substrate into a self-improvement cron:
+```ts
+import {
+  runProductionLoop,
+  httpGithubClient,
+  FileSystemFeedbackTrajectoryStore,
+} from '@tangle-network/agent-eval'
+import { FileSystemTraceStore } from '@tangle-network/agent-eval/traces'
+const result = await runProductionLoop({
+  runId: `weekly-${new Date().toISOString().slice(0, 10)}`,
+  target: 'tax-agent',
+  // 1. Where production traces + feedback land. Wire the HTTP ingestion
+  //    endpoints (POST /v1/traces/ingest, POST /v1/feedback) from your
+  //    runtime; the same store reads them here.
+  traceStore: new FileSystemTraceStore({ dir: 'data/prod-traces' }),
+  feedbackStore: new FileSystemFeedbackTrajectoryStore({ dir: 'data/prod-feedback' }),
+  // 2. Cluster threshold: act on failure groups ≥ 20 runs or ≥ 5% of corpus.
+  cluster: { minClusterSize: 20, minSeverityRatio: 0.05, maxClustersPerCycle: 1 },
+  // 3. Evolve: seed = current prompt, gate against holdout scenarios.
+  evolve: {
+    baselinePrompt: currentSystemPrompt,
+    holdoutScenarios: productionShapeScenarios,
+    runner,                            // your agent driver
+    scorer,                            // calibrated judge or rubric
+    mutator,                           // GEPA-style or addendum-style mutator
+    gate: {
+      baselineKey: 'baseline',
+      minProductiveRuns: 5,
+      pairedDeltaThreshold: 0.03,      // require Nσ improvement on holdout
+      overfitGapThreshold: 0.10,
+    },
+  },
+  // 4. Ship: when the gate passes, open a PR with the new prompt.
+  ship: {
+    client: httpGithubClient({ token: process.env.GITHUB_TOKEN! }),
+    repo: { owner: 'tangle-network', name: 'tax-agent' },
+    branchPrefix: 'eval/auto-improve',
+    promptFilePath: 'prompts/tax-agent-system.txt',
+    reviewers: ['drew'],
+  },
+  cron: { cadence: 'weekly' },         // surface-only; consumer schedules
+})
+console.log(result.decision)            // 'pr_opened' | 'gate_failed' | 'no_actionable_failures' | ...
+console.log(result.pullRequest?.prUrl)  // populated when a PR was opened
+```
+The primitive runs **one cycle**. Schedule it with `workflow_dispatch` + cron in
+GitHub Actions. It is **idempotent + replayable**: same `runId` → same plan.
+Gate failures are fail-closed — a candidate that beats baseline on search but
+overfits on holdout never lands.
+Full runnable demo (synthetic traces, no credentials) in
+[`examples/production-loop`](./examples/production-loop/README.md).
 ## Self-improvement loop
 Eval doesn't end at "pass/fail." Outcomes become training signal, mutation
@@ -222,6 +291,8 @@ and runtime. See [`examples/`](./examples/).
   closed loop — score, reflect, mutate, re-score, repeat.
 - [`examples/fine-tune-with-prime-rl`](./examples/fine-tune-with-prime-rl/README.md):
   RunRecord → preferences → trainer (prime-rl) → next campaign.
+- [`examples/production-loop`](./examples/production-loop/README.md):
+  ingest prod traces + feedback, cluster failures, evolve, gate, open a PR.
 ## Docs

package/dist/{chunk-SY6WAAAD.js → chunk-5LBB5B3Z.js} RENAMED Viewed

@@ -74,6 +74,114 @@ var HealthResponseSchema = z.object({
   status: z.literal("ok"),
   uptimeSec: z.number()
 }).openapi("HealthResponse");
+var TraceEventSchema = z.object({
+  eventId: z.string().min(1).describe("Stable id for the event. Use ULID or UUID."),
+  runId: z.string().min(1).describe("Run this event belongs to."),
+  spanId: z.string().optional().describe("Span that emitted the event, if any."),
+  kind: z.enum([
+    "log",
+    "error",
+    "budget_decrement",
+    "budget_breach",
+    "state_mutation",
+    "policy_violation",
+    "redaction_applied",
+    "custom"
+  ]).describe("Coarse event category \u2014 matches the TraceSchema v1 EventKind enum."),
+  timestamp: z.number().int().nonnegative().describe("Unix millis. Must be monotonically non-decreasing within a span."),
+  payload: z.record(z.string(), z.unknown()).describe("Free-form payload \u2014 the runtime owns the shape.")
+}).openapi("TraceEvent");
+var TracesIngestRequestSchema = z.object({
+  events: z.array(TraceEventSchema).min(1).max(1e4).describe("Batch of events. Max 10k per call \u2014 bigger streams should be chunked.")
+}).openapi("TracesIngestRequest");
+var TracesIngestResponseSchema = z.object({
+  accepted: z.number().int().nonnegative().describe("Number of events persisted."),
+  rejected: z.number().int().nonnegative().describe("Number of events the store refused \u2014 see `errors[]` for reasons."),
+  errors: z.array(
+    z.object({
+      eventId: z.string().describe("Event id this error applies to."),
+      message: z.string().describe("Why the event was rejected.")
+    })
+  ).default([])
+}).openapi("TracesIngestResponse");
+var FeedbackLabelSchema = z.object({
+  id: z.string().optional(),
+  source: z.enum(["user", "judge", "environment", "metric", "policy", "system"]),
+  kind: z.enum([
+    "approve",
+    "reject",
+    "select",
+    "edit",
+    "rank",
+    "rate",
+    "comment",
+    "metric_outcome",
+    "policy_block",
+    "revision_request"
+  ]),
+  value: z.unknown(),
+  reason: z.string().optional(),
+  severity: z.enum(["info", "warning", "error", "critical"]).optional(),
+  createdAt: z.string().describe("ISO-8601 UTC."),
+  metadata: z.record(z.string(), z.unknown()).optional()
+}).openapi("FeedbackLabel");
+var FeedbackAttemptSchema = z.object({
+  id: z.string().min(1),
+  stepIndex: z.number().int().nonnegative(),
+  artifactType: z.enum([
+    "text",
+    "code",
+    "plan",
+    "research",
+    "action",
+    "ui",
+    "decision",
+    "data",
+    "other"
+  ]),
+  artifact: z.unknown(),
+  options: z.array(z.unknown()).optional(),
+  proposedAction: z.object({
+    type: z.string(),
+    risk: z.enum(["low", "medium", "high"]).optional(),
+    costUsd: z.number().optional(),
+    externalSideEffect: z.boolean().optional(),
+    requiresApproval: z.boolean().optional(),
+    metadata: z.record(z.string(), z.unknown()).optional()
+  }).optional(),
+  feedback: z.array(FeedbackLabelSchema).optional(),
+  createdAt: z.string(),
+  metadata: z.record(z.string(), z.unknown()).optional()
+}).openapi("FeedbackAttempt");
+var FeedbackTrajectorySchema = z.object({
+  id: z.string().min(1).describe("Stable id; idempotency key for the trajectory."),
+  projectId: z.string().optional(),
+  scenarioId: z.string().optional(),
+  task: z.object({
+    intent: z.string().min(1),
+    context: z.unknown().optional()
+  }),
+  attempts: z.array(FeedbackAttemptSchema).default([]),
+  labels: z.array(FeedbackLabelSchema).default([]),
+  outcome: z.object({
+    success: z.boolean().optional(),
+    score: z.number().optional(),
+    metrics: z.record(z.string(), z.number()).optional(),
+    costUsd: z.number().optional(),
+    detail: z.string().optional(),
+    observedAt: z.string().optional(),
+    metadata: z.record(z.string(), z.unknown()).optional()
+  }).optional(),
+  split: z.enum(["train", "dev", "test", "holdout"]).optional(),
+  tags: z.record(z.string(), z.string()).optional(),
+  createdAt: z.string().describe("ISO-8601 UTC."),
+  updatedAt: z.string().optional(),
+  metadata: z.record(z.string(), z.unknown()).optional()
+}).openapi("FeedbackTrajectory");
+var FeedbackIngestResponseSchema = z.object({
+  id: z.string().describe("Trajectory id that was persisted."),
+  persisted: z.boolean().describe("True when the trajectory was saved (idempotent on id).")
+}).openapi("FeedbackIngestResponse");
 var ErrorResponseSchema = z.object({
   error: z.object({
     code: z.string().describe(
@@ -378,9 +486,43 @@ function handleVersion() {
     package: "@tangle-network/agent-eval",
     version: readPackageVersion(),
     wireVersion: WIRE_VERSION,
-    apiSurface: ["judge", "listRubrics", "version"]
+    apiSurface: ["judge", "listRubrics", "version", "feedback.ingest", "traces.ingest"]
   };
 }
+async function handleTracesIngest(req, stores) {
+  if (!stores.traceStore) {
+    throw new WireError(
+      "service_unavailable",
+      "No trace store configured on this server. Pass `traceStore` to `createApp`.",
+      503
+    );
+  }
+  const errors = [];
+  let accepted = 0;
+  for (const event of req.events) {
+    try {
+      await stores.traceStore.appendEvent(event);
+      accepted++;
+    } catch (err) {
+      errors.push({
+        eventId: event.eventId,
+        message: err instanceof Error ? err.message : String(err)
+      });
+    }
+  }
+  return { accepted, rejected: errors.length, errors };
+}
+async function handleFeedbackIngest(req, stores) {
+  if (!stores.feedbackStore) {
+    throw new WireError(
+      "service_unavailable",
+      "No feedback store configured on this server. Pass `feedbackStore` to `createApp`.",
+      503
+    );
+  }
+  await stores.feedbackStore.save(req);
+  return { id: req.id, persisted: true };
+}
 // src/wire/openapi.ts
 import { OpenAPIRegistry, OpenApiGeneratorV31 } from "@asteasolutions/zod-to-openapi";
@@ -392,6 +534,10 @@ function buildOpenApi(packageVersion) {
   registry.register("VersionResponse", VersionResponseSchema);
   registry.register("HealthResponse", HealthResponseSchema);
   registry.register("ErrorResponse", ErrorResponseSchema);
+  registry.register("TracesIngestRequest", TracesIngestRequestSchema);
+  registry.register("TracesIngestResponse", TracesIngestResponseSchema);
+  registry.register("FeedbackTrajectory", FeedbackTrajectorySchema);
+  registry.register("FeedbackIngestResponse", FeedbackIngestResponseSchema);
   registry.registerPath({
     method: "post",
     path: "/v1/judge",
@@ -458,6 +604,69 @@ function buildOpenApi(packageVersion) {
       }
     }
   });
+  registry.registerPath({
+    method: "post",
+    path: "/v1/traces/ingest",
+    summary: "Ingest a batch of production TraceEvents",
+    description: "Append a batch of TraceEvents to the configured TraceStore. Accepts application/json ({events:[...]}) or application/x-ndjson (one event per line). Returns counts of accepted + rejected events.",
+    request: {
+      body: {
+        content: {
+          "application/json": { schema: TracesIngestRequestSchema },
+          "application/x-ndjson": { schema: TracesIngestRequestSchema }
+        }
+      }
+    },
+    responses: {
+      200: {
+        description: "Ingestion summary",
+        content: { "application/json": { schema: TracesIngestResponseSchema } }
+      },
+      400: {
+        description: "Validation error",
+        content: { "application/json": { schema: ErrorResponseSchema } }
+      },
+      401: {
+        description: "Unauthorized (when bearer auth is configured)",
+        content: { "application/json": { schema: ErrorResponseSchema } }
+      },
+      503: {
+        description: "No trace store configured",
+        content: { "application/json": { schema: ErrorResponseSchema } }
+      }
+    }
+  });
+  registry.registerPath({
+    method: "post",
+    path: "/v1/feedback",
+    summary: "Ingest a FeedbackTrajectory from production",
+    description: "Persist a single FeedbackTrajectory. Idempotent on trajectory.id \u2014 re-posting replaces the prior record. Used by production runtimes to forward user \u{1F44D}/\u{1F44E}/edits into the eval substrate.",
+    request: {
+      body: {
+        content: {
+          "application/json": { schema: FeedbackTrajectorySchema }
+        }
+      }
+    },
+    responses: {
+      200: {
+        description: "Persisted",
+        content: { "application/json": { schema: FeedbackIngestResponseSchema } }
+      },
+      400: {
+        description: "Validation error",
+        content: { "application/json": { schema: ErrorResponseSchema } }
+      },
+      401: {
+        description: "Unauthorized (when bearer auth is configured)",
+        content: { "application/json": { schema: ErrorResponseSchema } }
+      },
+      503: {
+        description: "No feedback store configured",
+        content: { "application/json": { schema: ErrorResponseSchema } }
+      }
+    }
+  });
   const generator = new OpenApiGeneratorV31(registry.definitions);
   const doc = generator.generateDocument({
     openapi: "3.1.0",
@@ -608,14 +817,34 @@ import { serve } from "@hono/node-server";
 import { Hono } from "hono";
 import { cors } from "hono/cors";
 var STARTED_AT = Date.now();
-function createApp() {
+var AUTH_EXEMPT_PATHS = /* @__PURE__ */ new Set(["/healthz", "/v1/version", "/openapi.json"]);
+function createApp(opts = {}) {
   const app = new Hono();
   app.use("*", cors());
+  if (opts.auth) {
+    const verify = opts.auth.bearer;
+    app.use("*", async (c, next) => {
+      const path = new URL(c.req.url).pathname;
+      if (AUTH_EXEMPT_PATHS.has(path)) return next();
+      const raw = c.req.header("authorization") ?? "";
+      const match = raw.match(/^Bearer\s+(.+)$/i);
+      if (!match) {
+        throw new WireError("unauthorized", "Missing or malformed Authorization header.", 401);
+      }
+      const token = match[1];
+      const ok = typeof verify === "string" ? token === verify : await verify(token);
+      if (!ok) {
+        throw new WireError("unauthorized", "Invalid bearer token.", 401);
+      }
+      return next();
+    });
+  }
   app.onError((err, c) => {
     if (err instanceof WireError) {
+      const status = err.status;
       return c.json(
         { error: { code: err.code, message: err.message, details: err.details } },
-        err.status
+        status
       );
     }
     console.error("[agent-eval] unhandled error:", err);
@@ -644,11 +873,64 @@ function createApp() {
     const result = await handleJudge(parsed.data);
     return c.json(result);
   });
+  app.post("/v1/traces/ingest", async (c) => {
+    const contentType = c.req.header("content-type") ?? "";
+    let payload;
+    if (contentType.includes("application/x-ndjson")) {
+      const text = await c.req.text();
+      const events = text.split("\n").map((line) => line.trim()).filter((line) => line.length > 0).map((line) => {
+        try {
+          return JSON.parse(line);
+        } catch {
+          throw new WireError(
+            "validation_error",
+            "NDJSON line did not parse as JSON.",
+            400,
+            line.slice(0, 200)
+          );
+        }
+      });
+      payload = { events };
+    } else {
+      payload = await c.req.json().catch(() => null);
+    }
+    if (payload == null) {
+      throw new WireError("validation_error", "Request body must be JSON or NDJSON.", 400);
+    }
+    const parsed = TracesIngestRequestSchema.safeParse(payload);
+    if (!parsed.success) {
+      throw new WireError(
+        "validation_error",
+        "Request did not match TracesIngestRequest schema.",
+        400,
+        parsed.error.issues
+      );
+    }
+    const result = await handleTracesIngest(parsed.data, opts.stores ?? {});
+    return c.json(result);
+  });
+  app.post("/v1/feedback", async (c) => {
+    const raw = await c.req.json().catch(() => null);
+    if (raw == null) {
+      throw new WireError("validation_error", "Request body must be JSON.", 400);
+    }
+    const parsed = FeedbackTrajectorySchema.safeParse(raw);
+    if (!parsed.success) {
+      throw new WireError(
+        "validation_error",
+        "Request did not match FeedbackTrajectory schema.",
+        400,
+        parsed.error.issues
+      );
+    }
+    const result = await handleFeedbackIngest(parsed.data, opts.stores ?? {});
+    return c.json(result);
+  });
   app.get("/openapi.json", (c) => c.json(buildOpenApi(handleVersion().version)));
   return app;
 }
 function startServer(opts = {}) {
-  const app = createApp();
+  const app = createApp(opts);
   const port = opts.port ?? 5005;
   const host = opts.host ?? "127.0.0.1";
   return serve({ fetch: app.fetch, port, hostname: host }, ({ address, port: actualPort }) => {
@@ -666,6 +948,13 @@ export {
   ListRubricsResponseSchema,
   VersionResponseSchema,
   HealthResponseSchema,
+  TraceEventSchema,
+  TracesIngestRequestSchema,
+  TracesIngestResponseSchema,
+  FeedbackLabelSchema,
+  FeedbackAttemptSchema,
+  FeedbackTrajectorySchema,
+  FeedbackIngestResponseSchema,
   ErrorResponseSchema,
   WIRE_VERSION,
   hashRubric,
@@ -676,6 +965,8 @@ export {
   handleJudge,
   handleListRubrics,
   handleVersion,
+  handleTracesIngest,
+  handleFeedbackIngest,
   buildOpenApi,
   dispatchRpc,
   runRpcOnce,
@@ -683,4 +974,4 @@ export {
   createApp,
   startServer
 };
-//# sourceMappingURL=chunk-SY6WAAAD.js.map
+//# sourceMappingURL=chunk-5LBB5B3Z.js.map